1
00:00:00,000 --> 00:00:06,400
One of the biggest difficulties with unstructured data right now is that we have great tools for

2
00:00:07,200 --> 00:00:12,480
search and picking through, but that assumes you already know what you're looking for.

3
00:00:13,120 --> 00:00:19,440
And a lot of the time you don't know what you're looking for, especially if it's a brand new dataset

4
00:00:19,440 --> 00:00:24,960
and you have no idea what's in it. So it's the tools that allow you to step back and see the

5
00:00:24,960 --> 00:00:31,840
big picture of the whole dataset or bring it into focus in a different way that that's bringing in

6
00:00:31,840 --> 00:00:37,600
that sort of coherent whole information rather than you know looking at the data through a straw

7
00:00:37,600 --> 00:00:44,160
of just search and return the most likely results to your search term that only answers the questions

8
00:00:44,160 --> 00:00:52,800
you already knew to ask. How did the best machine learning practitioners get involved in the field?

9
00:00:52,800 --> 00:01:00,960
What challenges have they faced? What has helped them flourish? Let's ask them. Welcome to Learning

10
00:01:00,960 --> 00:01:07,360
from Machine Learning. I'm your host Seth Levine. Hello and welcome to Learning from Machine Learning.

11
00:01:07,360 --> 00:01:13,840
On this episode we have a very special guest, Leland McInnis. He's a researcher, a mathematician at

12
00:01:13,840 --> 00:01:19,360
the TUT Institute for Mathematics and Computing, a Canadian Research Institute. He's the maintainer

13
00:01:19,360 --> 00:01:26,480
for many machine learning packages including UMAP, HDB scan, PyNN descent, Datamap plot,

14
00:01:26,480 --> 00:01:31,840
which has created an ecosystem for unsupervised learning and has transformed the work that I'm

15
00:01:31,840 --> 00:01:37,680
doing. Leland, it is such a pleasure to have you here. Welcome to the show. Thanks, it's great to

16
00:01:37,680 --> 00:01:44,240
be here. So let's start off with this. What initially attracted you to the world of mathematics,

17
00:01:44,240 --> 00:01:50,800
computing and research? Well, to be honest, mathematics was something I grew up with. My

18
00:01:50,800 --> 00:01:56,240
father is a math professor or was a math professor at University of Canterbury in New Zealand where

19
00:01:56,240 --> 00:02:02,560
I grew up. By the time I was heading off to university, I wasn't sure if I was going to do

20
00:02:02,560 --> 00:02:10,240
math or not, but I ended up getting the opportunity to do math and physics, direct entry to second

21
00:02:10,240 --> 00:02:15,600
year. So I jumped in at that opportunity and it turned out I liked the math more than the physics

22
00:02:15,600 --> 00:02:21,600
and I had to quickly choose one or the other. Otherwise, my course load got overloaded. So

23
00:02:21,600 --> 00:02:28,480
ended up in math, stayed in pure math for a while there, but I actually took a break

24
00:02:28,480 --> 00:02:34,080
from academia. Winton worked in industry and with government for a couple years before heading

25
00:02:34,080 --> 00:02:39,600
back to do a PhD. So I've a little bit of experience with actually working with real data,

26
00:02:39,600 --> 00:02:44,960
as opposed to pure theoretical math, but then I just stayed with math for a long time after that.

27
00:02:45,760 --> 00:02:51,200
But working at the TUT Institute, we have a bunch of different people working on

28
00:02:51,200 --> 00:02:56,480
slightly different problems and I ended up leaning more into the data science machine

29
00:02:56,480 --> 00:03:03,120
learning questions despite my, to be honest, pure math background. Very cool. So what do you think

30
00:03:03,760 --> 00:03:07,440
having that like pure math theoretical background, how do you think that that

31
00:03:07,440 --> 00:03:12,000
influence your approach to some of these problems in the data science and machine learning world?

32
00:03:12,000 --> 00:03:18,320
I guess I'm coming at them from a different perspective. So my background was in apology

33
00:03:18,320 --> 00:03:24,960
and algebra. So I tend to think of things in terms of algebraic structures and the geometry of things.

34
00:03:24,960 --> 00:03:30,320
So I'm always thinking in terms of the geometry of the data. I don't know if that's super novel,

35
00:03:30,320 --> 00:03:36,240
but it's, I think a different take than I've encountered often chatting with people.

36
00:03:36,240 --> 00:03:42,560
So I guess it gives you a unique view on how to think about data, how to explore data. And I guess

37
00:03:42,560 --> 00:03:47,520
if you think about the geometry of it, when we're doing something like unsupervised learning,

38
00:03:47,520 --> 00:03:50,800
you're trying to figure out the underlying structure of the data. So there's definitely

39
00:03:50,800 --> 00:03:54,880
something that goes goes hand in hand there. Do you want to talk about how you approach the

40
00:03:54,880 --> 00:03:59,200
unsupervised learning problem? Like when you get a new data set, what are you thinking about?

41
00:04:00,000 --> 00:04:06,080
So unsupervised learning, that's definitely where I live precisely for that question. It's like,

42
00:04:06,080 --> 00:04:11,840
well, what's the geometry? What's going on in this data set? Those are the questions that interest

43
00:04:11,840 --> 00:04:17,120
me personally, just trying to explore something new rather than the more supervised learning

44
00:04:17,120 --> 00:04:20,640
approach where, you know, I've got a whole bunch of answers and I just want to reproduce them. I'm

45
00:04:20,640 --> 00:04:28,400
sure that's interesting, but it's less my field. I just want to understand data sets in general,

46
00:04:28,400 --> 00:04:34,240
and I don't think there are easy ways to do it for the vast majority of data that's out there now.

47
00:04:34,240 --> 00:04:40,400
So there's a lot of great exploratory data analysis methods for tabular data. You know,

48
00:04:40,400 --> 00:04:43,680
if you've got a database or a spreadsheet with nicely formatted rows and columns,

49
00:04:45,040 --> 00:04:51,360
there's standard statistical approaches, faceted plotting. This is a well understood problem.

50
00:04:51,360 --> 00:04:57,760
But all the other data out there, which is to say the vast majority of data we collect, record,

51
00:04:57,760 --> 00:05:06,240
and store is not that. It's free form text, it's videos, it's audio files, it's system logs on

52
00:05:06,240 --> 00:05:13,440
computers, it's all kinds of things. And how do you explore that kind of data set? How do you

53
00:05:13,440 --> 00:05:20,800
understand what's going on in it? That's an interesting challenge. So that's kind of the

54
00:05:20,800 --> 00:05:26,400
sorts of problems I want to try and solve. Yeah, and the amount of unstructured data is just

55
00:05:26,400 --> 00:05:30,800
continuing to increase, you know, obviously exponentially. So yeah, there are different

56
00:05:30,800 --> 00:05:36,800
techniques and tools that you can use to take unstructured data and try to bring some structure

57
00:05:36,800 --> 00:05:42,000
to it. So I guess like each individual piece of data, and then also trying to understand how the

58
00:05:42,000 --> 00:05:48,400
data points, the interrelationships between those data points as well, and how you would group data,

59
00:05:48,400 --> 00:05:52,560
right, how you partition data, how you could explore the data, you could visualize the data,

60
00:05:52,560 --> 00:05:59,040
and that's really where your libraries come in. I don't even know where the best place you would

61
00:05:59,040 --> 00:06:04,880
want to start with, you know, in terms of clustering, dimensionality reduction features,

62
00:06:04,880 --> 00:06:08,800
what's the usual kind of pipeline, right, you have certain features for your data,

63
00:06:09,440 --> 00:06:12,800
then usually there's too many features, you definitely can't visualize it because there's

64
00:06:12,800 --> 00:06:18,160
going to be more than two, three, maybe four of your dimensions, then you're going to want to have

65
00:06:18,160 --> 00:06:22,800
it in a shape and form where you could do some clustering, then you're going to want to visualize

66
00:06:22,800 --> 00:06:26,880
it, you're going to want to represent that data. So yeah, tell me a little bit about, you know,

67
00:06:26,880 --> 00:06:32,080
the libraries that you have and how they address some of those problems. Sure, so the first step is

68
00:06:32,080 --> 00:06:40,160
just getting the data into useful representation. And so that's turning this messy unstructured data

69
00:06:40,160 --> 00:06:47,040
into ideally something nice and mathematical and vectors seem to be the world we live in now.

70
00:06:47,040 --> 00:06:52,880
So you need a way to vectorize the data. Now, there's a lot of neural embedding techniques

71
00:06:52,880 --> 00:06:59,440
and they work super well. So I don't have too many solutions on that front. But if you have other

72
00:06:59,440 --> 00:07:05,360
kinds of data, we have a library called vectorizers that tries to deal with some of the ways of getting

73
00:07:05,360 --> 00:07:12,640
other kinds of unstructured data into vector formats. So whether you're using fancy neural

74
00:07:12,640 --> 00:07:17,680
embedding techniques or something else, if you can get it to a vector, now you have something you

75
00:07:17,680 --> 00:07:23,120
can work with. Because now all the data lives in a space where hopefully distance between data points

76
00:07:23,120 --> 00:07:30,160
is a meaningful thing. So that means you have geometry of some kind and you want to explore that.

77
00:07:30,160 --> 00:07:35,600
So there's a few things you can do. You can try and pull out dense regions, groups, clusters,

78
00:07:35,600 --> 00:07:42,960
and you can try and visualize that data. Now, usually if you're using any modern vectorizing

79
00:07:42,960 --> 00:07:47,920
technique, it's going to give you very high dimensional data. It'll be anywhere from a few

80
00:07:47,920 --> 00:07:53,920
hundred to several thousand dimensions. That's really hard to work with. For starters, most

81
00:07:53,920 --> 00:07:59,680
clustering algorithms aren't actually built around that kind of data set. Most clustering

82
00:07:59,680 --> 00:08:04,640
algorithms, a lot of the assumptions that are actually quietly baked in behind the scenes,

83
00:08:04,640 --> 00:08:13,600
assume that the data is pretty low dimensional, like anywhere from two to maybe 50, really.

84
00:08:13,600 --> 00:08:19,200
So your first problem is you have to manage to get the data to a state where you can cluster it

85
00:08:19,200 --> 00:08:23,360
reasonably and then hopefully one of those clustering algorithms that are out there will

86
00:08:23,360 --> 00:08:30,480
do the job. So I started out by building clustering algorithms because that's what I was interested

87
00:08:30,480 --> 00:08:37,840
in and then pivoted to how do I get the data to a state where I can cluster it. So I worked in

88
00:08:37,840 --> 00:08:45,280
dimension reduction after clustering and that led to visualizable representations of the data, but

89
00:08:46,400 --> 00:08:54,720
then you actually have to make that work for people. And so data map plot is my latest attempt

90
00:08:54,720 --> 00:09:02,320
there. So let's see what are some of the libraries that go on there. HDB scan for clustering. So

91
00:09:02,320 --> 00:09:08,000
that's density based clustering, but we want to be able to handle variable densities and do it

92
00:09:08,000 --> 00:09:15,520
as quickly as possible. U-map is a method for dimension reduction and visualization. That'll

93
00:09:15,520 --> 00:09:22,240
get you either down to a clusterable number of dimensions, maybe five or ten, or down to two

94
00:09:22,240 --> 00:09:28,320
or three dimensions so you can visualize the data. There's also data map plot, which the goal of that

95
00:09:28,320 --> 00:09:34,400
one is, oh, you've produced a two-dimensional representation of your data via U-map, TSNE,

96
00:09:34,400 --> 00:09:41,360
it doesn't really matter. You can hand it to this program and it will make you nice visually

97
00:09:41,360 --> 00:09:48,320
appealing static plots or interactive plots that you can explore and play with from there.

98
00:09:48,320 --> 00:09:56,800
So those are a few of the libraries. Cool. So we'll start with HDB scan. So you started working on

99
00:09:56,800 --> 00:10:03,920
that over a decade ago at this point, right? Yeah, yeah about that. What were you using for

100
00:10:03,920 --> 00:10:10,320
dimensionality reduction at the time? Different ones? I wasn't, I was working with lower dimensional

101
00:10:10,320 --> 00:10:15,760
data. So it was once the dimensions started going up, then you realized, okay, then you had to get

102
00:10:15,760 --> 00:10:20,960
it. Yeah, yeah, yeah. Then I had to pivot. Well, one, yeah, as you get to other data types, you

103
00:10:20,960 --> 00:10:25,920
suddenly realize, oh, this thing I was using is not working anymore. What's not working about it?

104
00:10:26,560 --> 00:10:35,120
So yeah. Right. So with your implementation of HDB scan, I mean, there's so much to it, it's such a

105
00:10:35,120 --> 00:10:41,200
fabulous algorithm in and of itself, this extension of DB scan where you can kind of control some more

106
00:10:41,200 --> 00:10:49,040
of these things. And yours is fast also. How did you get it? I mean, in the most, I guess, lay terms,

107
00:10:49,040 --> 00:10:54,240
how did you get it to become a much faster algorithm? Well, I mean, the algorithm was written by some

108
00:10:54,240 --> 00:11:00,320
other people who published great papers on it, but it was a slower algorithm. So what it came down to

109
00:11:00,320 --> 00:11:06,080
was I really read their paper and then wrote a very basic implementation that was quite slow,

110
00:11:06,080 --> 00:11:10,960
and was like, this does a better job of clustering than anything else. And then I just read some

111
00:11:10,960 --> 00:11:15,600
other papers, other random papers, and was like, oh, well, if I if I take this paper and this paper

112
00:11:15,600 --> 00:11:22,320
and just push them together, then it'll it'll go faster. And so it was really about the question

113
00:11:22,320 --> 00:11:28,400
of nearest neighbor search, because that was fundamental to what was going on internally

114
00:11:28,400 --> 00:11:37,680
inside HDB scan. So how do you make that faster? And there are some algorithms to do that. And it

115
00:11:37,680 --> 00:11:45,040
turns out that you can adapt them to work within HDB scan and make the whole thing flow together

116
00:11:45,040 --> 00:11:51,680
consistently. All the bits and pieces are there. There are slightly more messy and complicated than

117
00:11:51,680 --> 00:11:56,880
one would like. They're not the easiest of algorithms. So gluing them together took a little

118
00:11:56,880 --> 00:12:02,160
work. But really, I'm just going to get a work by other people as far as I'm concerned. Yeah,

119
00:12:02,160 --> 00:12:07,040
that's sometimes the most like that's sometimes when the breakthroughs happen, when you take the

120
00:12:07,040 --> 00:12:12,080
best pieces of them, or you can apply something from some other type of technique to the work to

121
00:12:12,080 --> 00:12:18,320
the work that you're doing. What was the speed up? It was like what and n squared and and log n or

122
00:12:18,320 --> 00:12:24,000
something and log n. Yeah, had to be looking up everything for every point before and then using

123
00:12:24,000 --> 00:12:31,280
minimum spanning trees or yeah. So the problem is that it wanted to compare every pair of points

124
00:12:31,280 --> 00:12:40,640
that gives you your n squared. But you can use space trees, space partitioning trees to cut

125
00:12:40,640 --> 00:12:46,000
that down to n log n. Although that doesn't work in high dimensions, it turns out because space trees

126
00:12:46,000 --> 00:12:50,080
don't work very well in high dimensions. But that's okay because you just need to get down

127
00:12:50,640 --> 00:12:56,160
low enough dimensions to make it work first. Yeah, so I guess that then brings us to UMAP.

128
00:12:56,160 --> 00:13:03,120
The really a breakthrough, I know that you major breakthrough in dimensionality reduction being

129
00:13:03,120 --> 00:13:10,640
able to capture both, you know, local and global structure of the data. But yeah, so you have had

130
00:13:10,640 --> 00:13:15,200
this problem where now you're dealing with data that had I guess a higher number of features and

131
00:13:15,200 --> 00:13:21,040
you had to reduce the data. So yeah, tell me about your thought process in creating something like

132
00:13:21,040 --> 00:13:29,280
UMAP. Well, I mean, I think the real breakthrough actually was TSE that came out in 2009.

133
00:13:30,480 --> 00:13:40,080
That's Hinton. Yeah, Lawrence of Underbutton and Hinton. And I think that was a real breakthrough

134
00:13:40,080 --> 00:13:46,960
because it really demonstrated the possibility of these kinds of algorithms in general.

135
00:13:46,960 --> 00:13:52,080
Well, nothing else in dimension reduction or manifold learning up until then had really been

136
00:13:52,080 --> 00:13:56,720
as effective, especially for like visualization. So getting down to really small numbers of

137
00:13:56,720 --> 00:14:03,680
dimensions and having still good representations that capture a bunch of useful information.

138
00:14:03,680 --> 00:14:10,720
So I think that's where it started. So I was very impressed with how effective that approach was.

139
00:14:10,720 --> 00:14:18,880
And so I just wanted to build a method that would work with the math and theory that I was

140
00:14:18,880 --> 00:14:25,280
familiar with. So for me, that was very much sort of, as I said, geometry of the data, algebraic

141
00:14:25,280 --> 00:14:31,840
topology was what I knew. So I was trying to build a theoretical basis for an algorithm out of that.

142
00:14:31,840 --> 00:14:37,760
Then it's just a matter of slotting various pieces together. Again, like I was a bit of a magpie and

143
00:14:37,760 --> 00:14:44,800
grabbed different papers from all over the place. So there was a paper by David Spivak on fuzzy

144
00:14:44,800 --> 00:14:50,640
simplicial sets, which I read and realized. So I actually read it because I was interested in it for

145
00:14:50,640 --> 00:14:58,960
HDB scan. And it had a lot of application there in some ways, because again, HDB scan, you can view

146
00:14:59,520 --> 00:15:04,800
from an algebraic topology lens or topological data analysis lens. But it really just opened my

147
00:15:04,800 --> 00:15:09,760
eyes to the possibilities of what one could do and how one could interpret things. And so just

148
00:15:09,760 --> 00:15:15,920
grabbing random bits and pieces of interesting math and algorithms and gluing them all together.

149
00:15:17,200 --> 00:15:24,400
Very cool. I've used, well, yeah, I mean, all of your libraries a lot. But UMAP in particular,

150
00:15:24,400 --> 00:15:32,560
you really see how using different parameters, you can get very wildly different results.

151
00:15:32,560 --> 00:15:41,280
Sometimes it's hard to know and evaluate if you're, how do you know if you're reducing dimensions

152
00:15:41,280 --> 00:15:48,960
correctly? Do you have any sense of that? I have some sense. But actually, this is a challenging

153
00:15:48,960 --> 00:15:54,320
problem. And I think the answer is that you don't, because it depends on what you're trying to do

154
00:15:54,320 --> 00:16:01,200
with it. What is it that you're trying to represent in low dimensions? I think this is a problem also

155
00:16:01,200 --> 00:16:07,680
for clustering that I find is a bit of an issue is people expect the clustering algorithm to produce

156
00:16:07,680 --> 00:16:15,120
like the true clusters, but there aren't any singular true clusters. It depends on what kind of

157
00:16:15,440 --> 00:16:21,600
things you want to get out. And so expecting there to be some magical true answer, and these

158
00:16:21,600 --> 00:16:26,080
other algorithms are approximating or how good do they compare to the true clusters is like,

159
00:16:26,080 --> 00:16:34,400
there isn't a single right answer. And so that's the disappointing thing with unsupervised learning.

160
00:16:34,400 --> 00:16:39,200
Supervised learning, you've got all these metrics that you can measure how well you're doing.

161
00:16:39,200 --> 00:16:45,040
Unsupervised learning, it's kind of just like stare at it and go like, well, this is doing

162
00:16:45,040 --> 00:16:53,600
what I need. That's good enough. Which is unsatisfying in some ways. But at the same time, if it's doing

163
00:16:53,600 --> 00:17:01,840
what you want, isn't that good enough? Yes. Yeah, it depends so much on what the use case is.

164
00:17:01,840 --> 00:17:10,640
Clustering is such an art. I think of it as equal art to science. How many clusters should there be?

165
00:17:10,640 --> 00:17:15,920
Just understanding that what should the structure of them be like, you know, having them in hierarchical

166
00:17:15,920 --> 00:17:21,680
nature, understanding what the group should be, understanding what point that group should belong

167
00:17:21,680 --> 00:17:27,200
to, you know, all of those things, depending on what the use case is going to be, what are you

168
00:17:27,200 --> 00:17:32,960
going to use it for? Is this to set you up for some supervised learning later? Is this just some

169
00:17:32,960 --> 00:17:40,320
exploratory data analysis? Is just to figure out, you know, some idea of how you could group or

170
00:17:40,320 --> 00:17:45,440
think about your data set. Something that you said that I really liked is that while it might not

171
00:17:45,440 --> 00:17:50,560
give you the answers, it helps you ask better questions of your data, a lot of these tools.

172
00:17:50,560 --> 00:17:59,200
Yeah, yeah, and that's very much what I'm interested in. I think one of the biggest

173
00:17:59,200 --> 00:18:06,720
difficulties with unstructured data right now is that we have great tools for search and picking

174
00:18:06,720 --> 00:18:13,440
through, but that assumes you already know what you're looking for. And a lot of the time you

175
00:18:13,440 --> 00:18:18,080
don't know what you're looking for, especially if it's a brand new data set and you have no idea

176
00:18:18,080 --> 00:18:23,920
what's in it. So it's the tools that allow you to step back and see the big picture of the whole

177
00:18:23,920 --> 00:18:31,600
data set or bring it into focus in a different way that's bringing in that sort of coherent whole

178
00:18:31,600 --> 00:18:38,080
information rather than, you know, looking at the data through a straw or just search and return the

179
00:18:38,080 --> 00:18:43,520
most likely results to your search term that only answers the questions you already knew to ask.

180
00:18:43,520 --> 00:18:51,760
Right. So going back into UMAP, you were inspired by T-SNE, but so why was there a need for

181
00:18:51,760 --> 00:18:58,880
another dimensionality reduction algorithm? It's a good question. I mean, T-SNE was at the time

182
00:18:58,880 --> 00:19:07,680
on the slower side and it had a tendency to kind of smush all the clusters kind of together. It

183
00:19:07,680 --> 00:19:14,240
separated them into clumps, but all the clumps were just wherever they landed. And I was interested

184
00:19:14,240 --> 00:19:19,680
in something that could get a little bit more of that non-local structure, some representation of

185
00:19:19,680 --> 00:19:28,400
how the clumps relate to each other, and also just a lot faster. And with a theory basis that I

186
00:19:28,400 --> 00:19:34,320
understood because I wanted to extend it in various ways. So one of them was semi-supervised

187
00:19:34,320 --> 00:19:40,640
and supervised versions of dimension reduction, but it's provided a framework for me that I can

188
00:19:40,640 --> 00:19:50,160
hang a lot of different adjustments to it, which is why I don't know if you've looked at the input

189
00:19:50,160 --> 00:19:56,240
hyperparameter set for UMAP, but it's kind of excessively large. It's because I just kept

190
00:19:56,960 --> 00:20:00,880
seeing good ideas and you know, I'm like, oh, I can add an option for that.

191
00:20:00,880 --> 00:20:07,280
Yeah. Well, we could talk about a couple of them. I mean, the ones that I think

192
00:20:08,160 --> 00:20:15,280
are my go-tos are obviously number of neighbors, min distance, number of components,

193
00:20:15,280 --> 00:20:20,240
depending on the visualization. Those are the three I play with the most, but I've dabbled in some

194
00:20:20,240 --> 00:20:23,440
other ones, any other good ones that you want to talk about?

195
00:20:23,440 --> 00:20:32,160
Yeah. So one actually is output metric. That's kind of fun. So there's a metric parameter where

196
00:20:32,160 --> 00:20:38,640
you determine how you're measuring distance in the between data points, but you can determine how

197
00:20:38,640 --> 00:20:43,760
you want to measure distance between data points in the output space, in the embedding space.

198
00:20:44,960 --> 00:20:52,720
So if you know that your data has periodic structure, it loops around, you can embed onto

199
00:20:52,720 --> 00:20:58,640
a tourist, not the plane or onto a sphere or one of the one of the interesting options is you can

200
00:20:58,640 --> 00:21:06,080
embed not as points, but as Gaussians with a covariance structure and measure distance between

201
00:21:06,080 --> 00:21:13,040
Gaussians. So then you get an embedding where points have some uncertainty about where they're

202
00:21:13,040 --> 00:21:18,720
going to land. Very interesting. Yeah. It's sort of going back to like what your use case is for

203
00:21:18,720 --> 00:21:22,560
what you're doing. If you understand what the use case is, then you can incorporate that into

204
00:21:22,560 --> 00:21:30,880
the parameter. Very cool. In terms of UMAP and creating a library like this, so you were kind

205
00:21:30,880 --> 00:21:40,880
of creating it to scratch your itch to solve your problem. But there's so many of the visualizations

206
00:21:40,880 --> 00:21:49,840
of data are using UMAP now. What are some of the most unexpected uses of UMAP that you've seen?

207
00:21:49,840 --> 00:21:57,040
So in 2020, I was very surprised to see it coming up in COVID research repeatedly, which

208
00:21:58,320 --> 00:22:05,440
as a mathematician, that was not something that I felt I would ever be able to contribute to.

209
00:22:06,640 --> 00:22:14,080
And yet at the same time, at the start of the pandemic, when everything was panic stations,

210
00:22:14,080 --> 00:22:19,120
it was really interesting to see that I had managed to do something that helped some people

211
00:22:19,120 --> 00:22:27,440
somehow on solving this problem that was inspiring. It gets used in art actually a bunch. The pictures

212
00:22:27,440 --> 00:22:34,640
behind me are by an artist, Rific Anadol, who uses UMAP among many other machine learning tools to

213
00:22:34,640 --> 00:22:40,560
help develop art. There are a bunch of other artists I've been in touch with who also make use of it

214
00:22:40,560 --> 00:22:48,560
in various ways. And like that's not a use case that I ever had in mind. Right. Yeah. Well, the

215
00:22:48,560 --> 00:22:56,320
outputs that you can get are quite beautiful. The interesting thing from my perspective is,

216
00:22:56,320 --> 00:23:04,720
so I was doing topic modeling, I guess before UMAP, I was using, I was doing it when LDA,

217
00:23:04,720 --> 00:23:14,560
LSA, and MF, you know, were the state of topic modeling at the time. And now I've seen how your

218
00:23:14,560 --> 00:23:19,920
libraries have transformed this whole topic modeling space, a highly used library like

219
00:23:19,920 --> 00:23:25,920
BERT topic, the default parameters for dimensionality reduction, library that you're the maintainer.

220
00:23:25,920 --> 00:23:32,400
And you know, one of the creators of is UMAP. And the default clustering algorithm is HDB scan,

221
00:23:32,400 --> 00:23:39,600
same same thing for you. So I have like witnessed your work transforming the work that I do every

222
00:23:39,600 --> 00:23:45,440
data set that I take. The first thing I do is run it through some pipeline that involved that

223
00:23:45,440 --> 00:23:51,520
involves your libraries. And now as of late, I've been using data map plot, which is such a cool

224
00:23:51,520 --> 00:23:57,120
visualization library. There's so many cool things that are happening. Because when you try to get,

225
00:23:57,840 --> 00:24:01,520
you never know exactly what you're going to get, right? When you get when you get an output,

226
00:24:02,560 --> 00:24:08,560
you could have 40 clusters, you could have 400 clusters, you could have like, you know,

227
00:24:08,560 --> 00:24:13,600
depending on what what your data set is. And the nice thing is, you've created something that by

228
00:24:13,600 --> 00:24:20,320
default tries to give you a really nice output that you can just kind of just take it and you

229
00:24:20,320 --> 00:24:24,640
can read sometimes you have to do some minor tweaks. But a lot of the things are done in the

230
00:24:24,640 --> 00:24:30,560
back end for you. So you don't need to worry about like a lot of the spacing and yeah, a lot of that

231
00:24:30,560 --> 00:24:35,440
stuff. What was like the most what was the motivation there you just wanted a way to visualize some of

232
00:24:35,440 --> 00:24:43,520
these outputs? I think the real motivation was for some internal use cases. I've seen people

233
00:24:43,520 --> 00:24:48,240
using this and they were giving presentations at the end of a workshop or something like that.

234
00:24:48,240 --> 00:24:55,680
And that put up some nice UMAP plots. And they don't have a lot of time. It's a short workshop.

235
00:24:55,680 --> 00:25:00,320
And then they want to give a presentation on the work they did. So they just plot however they can.

236
00:25:00,320 --> 00:25:06,000
And you know, I would look at it and be like, ah, if if I had the time I could get in there and I

237
00:25:06,000 --> 00:25:09,760
could, you know, tweak a bunch of things about that plot and make it look a whole lot better.

238
00:25:10,880 --> 00:25:17,040
And I saw that happen often enough that I realized I should it's not fair that like I have the time

239
00:25:17,040 --> 00:25:22,720
to sit down and tweak all the parameters on the plot. But these other people they don't have that

240
00:25:22,720 --> 00:25:29,520
experience of having done a lot of these kinds of plots and they don't have the time to fiddle

241
00:25:29,520 --> 00:25:36,000
with Matplotlib or something like that for, you know, a couple days to make the plot look just

242
00:25:36,000 --> 00:25:43,040
so. I should take all the stuff that I've learned and just try and shove it into a library that

243
00:25:43,040 --> 00:25:48,640
allows them to get a pretty looking plot that does all the things that I would want to tweak for them

244
00:25:49,280 --> 00:25:52,640
out of the box and then hopefully give them enough knobs that they can

245
00:25:53,520 --> 00:25:58,080
they can make it look like what they want at the end of the day as well.

246
00:25:58,080 --> 00:26:04,320
Yeah, it's it's such a great tool. I mean, in so much of the data science work that we do,

247
00:26:04,320 --> 00:26:08,240
like, you know, there's the technical, the heavy aspects, there's the coding, there's the

248
00:26:08,240 --> 00:26:13,360
understanding of the data, there's the data engineering. But almost equally, you have to be

249
00:26:13,360 --> 00:26:18,240
able to share your results, right? And you have to be able to kind of let other people in from all

250
00:26:18,240 --> 00:26:23,280
different levels, executive level, different technical level, anyone, and those types of

251
00:26:23,280 --> 00:26:28,000
visualizations. That's what I get most excited about is that when you show someone like that,

252
00:26:28,000 --> 00:26:33,200
anyone can kind of relate and anyone can quickly understand. So just yeah, like the power of

253
00:26:33,200 --> 00:26:38,480
visualizations, if you wanted to like dial in on that how a visualization and the ability to kind

254
00:26:38,480 --> 00:26:43,760
of share your work and get other people involved in it, how that kind of helps you with your work.

255
00:26:43,760 --> 00:26:50,080
Seeing your data is just such a wonderful experience, because we're we're visual creatures. Our vision

256
00:26:50,080 --> 00:26:57,600
is their primary means of getting input into our brains. If you can turn data that's this

257
00:26:57,600 --> 00:27:04,160
giant mess into something that somebody can see, that's an enlightening experience. And I've had,

258
00:27:04,160 --> 00:27:11,360
you know, plenty of cases where I've worked with, you know, domain expert on data set that they care

259
00:27:11,360 --> 00:27:18,160
about that I realize I know nothing about. But we can work through, get it to a plot, stick it in

260
00:27:18,160 --> 00:27:23,920
front of them. And their first question is almost always, well, what's what's that? And why is it

261
00:27:23,920 --> 00:27:28,880
there? Because there's something new in the data that they didn't realize was there. And then we

262
00:27:28,880 --> 00:27:34,640
have to dig in to those questions. And so that's that's where I guess also the interactive plotting

263
00:27:34,640 --> 00:27:40,240
really starts to provide a lot more value because just seeing seeing the plot is good.

264
00:27:41,680 --> 00:27:46,480
But now now I want to answer the questions. So how do how do I do that bit next?

265
00:27:47,120 --> 00:27:51,120
Yeah, that's a good point that you brought up because like, at first, I was just thinking about,

266
00:27:51,120 --> 00:27:55,600
you know, like, oh, you're exploring the data, but why are you exploring the data? Right? Like,

267
00:27:56,160 --> 00:27:59,600
yes, you want to figure out good groups. But yeah, another thing that you that I always find

268
00:27:59,600 --> 00:28:05,200
whenever I create that output, you always figure out where those outliers are, like those weird

269
00:28:05,200 --> 00:28:11,520
artifacts of the data that somehow got in there. And then you can kind of figure out a strategy

270
00:28:11,520 --> 00:28:15,360
of how to have how to deal with that. And then what I'll do is I'll kind of just like do like a

271
00:28:15,360 --> 00:28:21,520
loop of that. Right? Like, I use clustering to figure out what data points maybe I should pull out.

272
00:28:21,520 --> 00:28:25,120
And then I find you map works better than I find like HDB scan works better. And like,

273
00:28:25,120 --> 00:28:29,440
I just kind of like keep doing this iterative process, which gives you which ends up giving

274
00:28:29,440 --> 00:28:35,840
you some some really nice results. It's amazing, like how outlaw like outliers affect can affect

275
00:28:35,840 --> 00:28:42,240
like everything. But that's why HDB HDB scan handles noise. That's one of the things that makes

276
00:28:42,240 --> 00:28:47,760
that that a very robust algorithm. Yeah, your work is just got I I'm just I'm happy that we're

277
00:28:47,760 --> 00:28:54,640
talking because I remember I posted something I used one of your visualizations and one of

278
00:28:54,640 --> 00:28:59,600
in one of my projects. And I was looking forward to chatting with you. And then I we bumped into

279
00:28:59,600 --> 00:29:04,560
each other in New York, when you're on that panel with hugging face, because yeah, the work that

280
00:29:04,560 --> 00:29:12,720
you're doing it, it makes the work for so many other people so much easier. So much easier.

281
00:29:12,720 --> 00:29:20,560
So many of the things are abstracted away, where you can just kind of focus on your data sets,

282
00:29:20,560 --> 00:29:25,760
focus on understanding your data better, focus on asking better questions of your data. And I think

283
00:29:25,760 --> 00:29:33,440
it's probably because you went in writing you didn't you weren't allowing it to be a black box,

284
00:29:33,440 --> 00:29:37,360
right? In one of the times that you spoke, you mentioned something about like,

285
00:29:38,400 --> 00:29:43,520
decomposing these black boxes, and then that kind of sets you up and you can understand

286
00:29:43,520 --> 00:29:49,680
how these things interact with each other. Can you can you talk to that? Yeah, I mean, so this is

287
00:29:49,680 --> 00:29:57,840
a thing that I definitely see. And so you mentioned BearTopic and Martin Grugendorff, who wrote that

288
00:29:57,840 --> 00:30:06,560
has done a fabulous job with that of exposing the innards as Lego bricks that you can swap in. So

289
00:30:06,560 --> 00:30:14,320
yeah, the defaults are UMAP and HDB scan, but you can use an SVD and then K-means if you want,

290
00:30:14,320 --> 00:30:19,520
or you can you can swap all the different components apart. And so seeing it as this

291
00:30:19,520 --> 00:30:27,600
composable piece of Lego bricks, I think is really valuable. But the same applies to even the innards

292
00:30:27,600 --> 00:30:35,360
of the other algorithms themselves. So personally, I see HDB scan as a pile of Lego bricks, there are

293
00:30:35,360 --> 00:30:40,960
a bunch of different things. So there's there's a density estimation step, there's connectivity

294
00:30:40,960 --> 00:30:46,960
related step, there's building a tree of clusters step, and then there's a cluster extraction step.

295
00:30:46,960 --> 00:30:51,680
And these are all like, there are different algorithms for doing each of those things.

296
00:30:51,680 --> 00:30:57,760
HDB scan packages together a default set, but you can easily swap out each of those parts with

297
00:30:57,760 --> 00:31:02,720
something new. It's just a pile of Lego bricks. And the same with UMAP, there's constructing

298
00:31:04,000 --> 00:31:10,560
representation of the high dimensional data in some sort of graph based way. If you want to think

299
00:31:10,560 --> 00:31:15,200
in terms of algebraic topology, it's a it's a simple set. But there's lots of different ways

300
00:31:15,200 --> 00:31:21,040
of doing that. You could swap out that component. There's how you're going to optimize a low

301
00:31:21,040 --> 00:31:25,920
dimensional representation. Again, this is just components and pieces. If you look at something

302
00:31:25,920 --> 00:31:32,800
like LDA, a lot of that you mentioned as earlier topic modeling, one way of looking at that is,

303
00:31:32,800 --> 00:31:39,280
you know, this plate model in the probabilistic view. But you can also just view it as a matrix

304
00:31:39,280 --> 00:31:45,760
factorization algorithm and decompose it into the parts of how matrix factorization works,

305
00:31:45,760 --> 00:31:50,000
and give yourself, if not an understanding of everything there is to know about it,

306
00:31:50,000 --> 00:31:56,000
and understanding of the pieces and how they fit together in a way that would allow you to adapt it

307
00:31:56,000 --> 00:32:04,160
to different problems if you wanted to do that. So, you know, it works with categorical data,

308
00:32:04,160 --> 00:32:08,880
because you're looking at a distribution of words and the prior for that is a Dirichlet

309
00:32:08,880 --> 00:32:15,680
distribution. But if you had count data that was much more Poisson distributed, well, then you need

310
00:32:15,680 --> 00:32:22,560
a prior for the Poisson, there's gamma distributions for that, you could build an LDA like algorithm

311
00:32:22,560 --> 00:32:27,120
pretty easily that would work for an entirely different data type, if that's what you want to do,

312
00:32:27,120 --> 00:32:33,360
if you have broken it down into the pieces like that. I think more time spent understanding

313
00:32:33,360 --> 00:32:40,320
what the actual pieces of these algorithms are is pretty valuable if you ever want to adapt them

314
00:32:40,320 --> 00:32:44,880
or use them for something else. Right. That makes a lot of sense. Yeah, if you don't just say, oh,

315
00:32:44,880 --> 00:32:50,240
this is a black box, I'm abstracted away, the only thing I have control over just, you know, these

316
00:32:50,240 --> 00:32:55,920
these parameters, well, you'll never really be able to fully grasp what's taking place. You won't

317
00:32:55,920 --> 00:33:01,200
be able to really build on top of the work in a way. So you have to decompose you have to decompose

318
00:33:01,200 --> 00:33:09,440
those things. That's probably my guess on why you have pyn and descent, I would say. Yep. So, I mean,

319
00:33:09,440 --> 00:33:15,280
you talked about taking pieces to make make tasks easier for other people. Pyn and descent is an

320
00:33:15,280 --> 00:33:23,840
example of me taking a piece that makes work easier for me. Because nearest neighbor search,

321
00:33:23,840 --> 00:33:29,600
I mentioned HDB scan needs to make use of some of those sorts of things. This is also comes up a

322
00:33:29,600 --> 00:33:36,880
lot in UMAP, it comes up a lot in a lot of other places. I needed a thing that would do that in

323
00:33:36,880 --> 00:33:42,880
the ways that I needed it done. So, you know, again, I'm borrowing algorithms from other people. There

324
00:33:42,880 --> 00:33:49,680
was a great paper on an algorithm called nearest neighbor descent that builds approximate k nearest

325
00:33:49,680 --> 00:33:55,120
neighbor graphs very efficiently. There's a bunch of, again, that decomposes into chunks, and you

326
00:33:55,120 --> 00:34:00,240
could pull pull apart the pieces, swap in some other pieces, change a few other pieces. And, you

327
00:34:00,240 --> 00:34:06,080
know, that's what I did with with pyn and descent to get the implementation that I have that solves

328
00:34:06,080 --> 00:34:12,480
the problems I have. So, I wanted things like work with a lot of different metrics, a lot of

329
00:34:12,480 --> 00:34:18,720
approximate nearest neighbor search will do Euclidean and cosine, and that's it. I wanted to be able

330
00:34:18,720 --> 00:34:24,080
to do anything. And I needed needed to be able to work with sparse data structures as well. So,

331
00:34:25,040 --> 00:34:31,920
again, does that out of the box. These are, you know, problems that I needed solved, and it was

332
00:34:31,920 --> 00:34:39,120
best to just package it up. And now I get I get to reuse it all the time. So, I have a new clustering

333
00:34:39,120 --> 00:34:45,520
library called evoke for embedding vector oriented clustering. And it steals a whole bunch of stuff

334
00:34:45,520 --> 00:34:50,800
from pyn and descent because it saves a lot of trouble to just reuse all of that work packaging

335
00:34:50,800 --> 00:34:56,000
them up so they can be reused. But at the same time, being able to decompose them back again into

336
00:34:56,000 --> 00:35:02,800
the parts that you need. So yeah, you've basically created a whole ecosystem of all of these parts

337
00:35:02,800 --> 00:35:07,760
that you can fit together to approach many different problems, but specifically like unsupervised

338
00:35:07,760 --> 00:35:16,320
learning in a very powerful, powerful way. In terms of the future of some of this work, or maybe one

339
00:35:16,320 --> 00:35:21,440
of the challenges in unsupervised learning or topic modeling in particular is something

340
00:35:21,440 --> 00:35:28,080
like trying to find new topics over time, right, trying to incorporate the temporal feature.

341
00:35:28,640 --> 00:35:31,280
Have you thought about that? Have you been thinking about that at all?

342
00:35:31,840 --> 00:35:37,680
Yeah, so temporal topic modeling is definitely something I've been giving a bit of thought to

343
00:35:37,680 --> 00:35:44,560
and how best to handle that. There are algorithms from topological data analysis, an algorithm called

344
00:35:44,560 --> 00:35:51,440
Mapper that's based on Morse theory that actually lends itself to these kinds of problems pretty well.

345
00:35:53,040 --> 00:35:58,320
And I actually just recently had the opportunity to work with a co-op student who was visiting the

346
00:35:58,320 --> 00:36:06,880
Institute. He's now off doing his PhD at Waterloo. He worked on a paper on an extension of Mapper that

347
00:36:06,880 --> 00:36:13,520
would be pretty much ideal to solve this kind of problem for specifically for the kinds of things

348
00:36:13,520 --> 00:36:20,880
you get from topic modeling over time. So that's hopefully that's on archive now and if people

349
00:36:20,880 --> 00:36:28,720
want to go and read a highly theoretical paper, I would recommend checking out that. I don't have

350
00:36:28,720 --> 00:36:36,640
the link on me at the moment, but I can edit. These are fun and challenging problems. So

351
00:36:36,640 --> 00:36:44,960
adjusting over time is a big challenge. So again, another thing that would be great would be to have

352
00:36:44,960 --> 00:36:51,360
a UMAP embedding that can evolve over time as new data comes in. Right now, if you rerun UMAP,

353
00:36:52,000 --> 00:36:58,880
it gives you qualitatively pretty similar results, but it's in variant, the optimization

354
00:36:58,880 --> 00:37:04,720
problems in variant under rotation and flipping. So at the very least, it could flip things around.

355
00:37:04,720 --> 00:37:11,360
And if you just use the transform method as it exists now, that is based on the data that we've

356
00:37:11,360 --> 00:37:18,000
already seen. It's just going to fit new data in as best it can, given this training set. So if a new

357
00:37:18,000 --> 00:37:24,480
cluster of data actually shows up, it's just going to squeeze it in amongst all the other data that's

358
00:37:24,480 --> 00:37:33,360
there. So there is some ongoing work to try and make a version of UMAP that would allow you to

359
00:37:33,360 --> 00:37:40,240
evolve in this way. So it's sort of adapt to the new data that comes in. So if you get, let's say,

360
00:37:40,240 --> 00:37:45,360
you're looking at a month of data, then another week of data comes in. Instead of just trying to

361
00:37:45,360 --> 00:37:50,560
force that week into the last month, it would be adapting and changing. So if a new cluster came up,

362
00:37:50,560 --> 00:37:56,080
you could see that new cluster. Yeah, a new cluster would form and the rest of the data would have to

363
00:37:56,080 --> 00:38:02,000
move to fit around it and so on. Yeah, that's very much the goal. And I think that's definitely

364
00:38:02,000 --> 00:38:07,200
possible. So there's some great work being done by some other people on that that I'm

365
00:38:07,200 --> 00:38:13,920
desperately trying to follow and keep up with and will happily merge into UMAP main as soon as

366
00:38:13,920 --> 00:38:20,960
they get it in the state that they're happy with. Very exciting. Cool. So we talked a lot about

367
00:38:20,960 --> 00:38:29,200
unsupervised learning. I want to zoom out a little bit. And I'm going to ask the question around the

368
00:38:29,200 --> 00:38:36,880
hype of AI and machine learning. There's this, all of these promises are being made and coming

369
00:38:36,880 --> 00:38:41,360
from somebody who, you know, is creating these algorithms, seeing the things that we can do,

370
00:38:41,360 --> 00:38:46,960
maybe seeing some of the limitations. I'm curious what your view is on the gap between the hype

371
00:38:46,960 --> 00:38:51,760
and the reality that you see? Let's start with the reality. There's a lot of value and a lot of

372
00:38:51,760 --> 00:38:58,720
the stuff out there. It has enabled all sorts of things that are really just incredibly powerful

373
00:38:58,720 --> 00:39:08,240
and useful. But the hype is something else again. Like I am not a fan of the amount of hype around

374
00:39:08,240 --> 00:39:15,680
a lot of these things. I mean, for me personally as a user, a lot of the value in the generative

375
00:39:15,680 --> 00:39:23,200
large language models is just as a natural language interface. Right? Like, I mean, retrieval

376
00:39:23,200 --> 00:39:29,360
augmented generation is the thing. But really, that's old school information retrieval.

377
00:39:29,360 --> 00:39:37,200
A whole lot of hard work. And then you hand that to the LLM, which does the interfacing to the user

378
00:39:37,200 --> 00:39:41,760
of taking the question in natural language and then taking those information retrieved results

379
00:39:41,760 --> 00:39:48,800
and turning it into a nice natural language answer to the question. Now, there's a lot to be said

380
00:39:48,800 --> 00:39:55,520
for the value of providing a natural language interface to computers for users. So there's

381
00:39:55,520 --> 00:40:01,920
the value. That's a huge value proposition. But that's not what most of the hype about them is about.

382
00:40:02,960 --> 00:40:10,000
And at the same time, I think a lot of the like embedding approaches where you vectorize text,

383
00:40:10,000 --> 00:40:14,800
images, video, whatever, and just turn it into vectors, a lot of people seem to view that as a

384
00:40:14,800 --> 00:40:20,800
thing that you can then put in your retrieval augmented generation system. But it's so much

385
00:40:20,800 --> 00:40:26,000
more valuable than that. And there's so much more you can do with that. I think there's untapped

386
00:40:26,000 --> 00:40:32,320
value there still in all the various things. So I mean, topic modeling is one example. But

387
00:40:32,320 --> 00:40:37,200
I mean, you could easily convert that to topic modeling for images. I think Burr Topic already

388
00:40:37,200 --> 00:40:44,880
supports a basic version of that. But you could turn that into all kinds of things very easily.

389
00:40:44,880 --> 00:40:51,760
Yeah. One cool injection of generative or the complement of generative to topic modeling

390
00:40:52,400 --> 00:40:57,920
was one of the toughest parts was you get your clusters in the end. And you just have to kind

391
00:40:57,920 --> 00:41:03,680
of figure out what the name is and what how to define what that topic is and you know, things

392
00:41:03,680 --> 00:41:09,040
like that. So that has been a really for me an exciting use case of generative. I mean, maybe

393
00:41:09,040 --> 00:41:14,080
not everyone would find that exciting. But if you know the pain of trying to name all of your

394
00:41:14,080 --> 00:41:20,240
clusters, yes, if you can get anything to help you do it, that for me felt like a very good

395
00:41:20,240 --> 00:41:25,920
and valuable use case for generative like here, take, you know, with these keywords and take

396
00:41:25,920 --> 00:41:32,640
with these example documents, and now give me a good three word or less name for this for this

397
00:41:32,640 --> 00:41:39,200
cluster. It allows you try to understand your data quickly or get a good sense of your data quickly,

398
00:41:39,200 --> 00:41:45,040
create potential classes. That's something that I found. But yeah, in terms of the hype, the hype

399
00:41:45,040 --> 00:41:48,800
of all of this couldn't be higher. It's supposed to solve all of our problems.

400
00:41:49,680 --> 00:41:55,840
Rag systems in general, you know, the idea, Oh, I'll just vectorize everything. And I'll just

401
00:41:55,840 --> 00:42:01,920
find the most relevant documents like I don't then you haven't really used cosine similarity,

402
00:42:01,920 --> 00:42:07,520
because you don't really like information retrieval is really hard. And it's and people have been

403
00:42:07,520 --> 00:42:10,960
working on it for a long time. And there's what there's a reason why there's so many different

404
00:42:10,960 --> 00:42:15,440
algorithms. There's a reason why there's like whole businesses around it. And you have to take

405
00:42:15,440 --> 00:42:20,480
into account so many things, obviously, semantic patterns, lexical patterns, just like incorporating

406
00:42:20,480 --> 00:42:26,240
metadata. I found some, you know, you can call them rag systems, but just being able to retrieve

407
00:42:26,240 --> 00:42:30,800
the necessary documents just by like kind of using like, other types of filtering that and

408
00:42:30,800 --> 00:42:35,600
give you very, very good results. It is exciting to see what will see what will happen. But I think

409
00:42:35,600 --> 00:42:40,480
that there's a lot of hype and new terminology being used for things that have been worked on for

410
00:42:40,480 --> 00:42:47,920
a long time. Yeah, for a long time. What do you see as a question that you believe remains unanswered

411
00:42:47,920 --> 00:42:53,920
currently in either machine learning or some of the work that you're doing? So I think there's a

412
00:42:53,920 --> 00:43:00,640
whole lot of scope for work to be done still in unsupervised learning. That's hugely biased

413
00:43:00,640 --> 00:43:04,480
because that's the field I work in. But I look at the state of things and I'm like,

414
00:43:04,480 --> 00:43:08,880
this could all be so much better. It's not like I have the answers for how to make it better. But

415
00:43:09,440 --> 00:43:16,560
I can definitely see lots of room and directions for improvement. I mean, even simple things like

416
00:43:16,560 --> 00:43:22,960
we have these sentence embedding models like SBIRT and they're fantastic. But do you need the

417
00:43:22,960 --> 00:43:29,920
full power of a giant transformer based neural network to make that work? Because I'm personally,

418
00:43:29,920 --> 00:43:34,240
I'm a fan of what's the simplest possible thing you can do that still does a good enough job.

419
00:43:35,040 --> 00:43:39,920
And I'd be really interested to know if you can produce sentence embeddings,

420
00:43:40,880 --> 00:43:46,880
like 98% as good as the transformer model with something that's just way simpler and easier

421
00:43:46,880 --> 00:43:52,560
to understand. Because the internal workings of what that pre-trained model has learned,

422
00:43:52,560 --> 00:44:02,160
that's harder to pick apart. But if you've got, that's a black box that's harder to decompose

423
00:44:04,240 --> 00:44:11,840
into pieces. I'd love a more decomposable version of say sentence embedding or vectorization or

424
00:44:11,840 --> 00:44:18,800
any of these sorts of things. Yeah. So some of the interpretability and understanding of

425
00:44:18,800 --> 00:44:25,200
those of those embedding models, there's a lot of very, you know, very interesting work. Tom

426
00:44:25,200 --> 00:44:30,400
Arson is doing an incredible job with sentence transformers and the ability to create custom

427
00:44:30,400 --> 00:44:35,600
embeddings and things like that. That's something that I'm very excited about. But yeah, like,

428
00:44:36,320 --> 00:44:39,680
I mean, to go into some of the things that we're doing, like, yeah, you create these embeddings,

429
00:44:39,680 --> 00:44:44,160
they're 512 dimensions, 760, 810s, 24, like they're huge sometimes. And it's like, well,

430
00:44:44,160 --> 00:44:48,560
do you really need it to be that big? Then you have to go through some sort of dimensionality

431
00:44:48,560 --> 00:44:56,960
reduction. I wonder if there's some combination of embeddings that could feed right into a

432
00:44:56,960 --> 00:45:02,880
clustering algorithm. That's something that I think would be cool. Yeah. Yeah. But now you need

433
00:45:02,880 --> 00:45:07,520
to decompose the parts and glue them together slightly differently. For like, if you have a

434
00:45:07,520 --> 00:45:13,600
specific task, there's definitely things you could do. How does that work? I don't know. But I

435
00:45:13,600 --> 00:45:18,400
think there'd be great answers if you could figure it out. Yeah. Well, there has to be some things

436
00:45:18,400 --> 00:45:23,840
that we don't have the answers to. So there's a reason to come and continue thinking and working

437
00:45:23,840 --> 00:45:28,480
on all this stuff. There's so many exciting things. All right, to zoom even further out,

438
00:45:28,480 --> 00:45:35,600
I'll ask an advice question. What advice would you give to someone that's just starting out in

439
00:45:35,600 --> 00:45:42,160
the field of either research, data science, machine learning? So specifically for data science and

440
00:45:42,160 --> 00:45:47,920
machine learning, my advice is don't follow the hype in the same direction that everyone else is

441
00:45:47,920 --> 00:45:54,960
going because you're not going to make a dent in the field that everyone is already working on.

442
00:45:55,840 --> 00:46:01,520
Go do whatever is interesting to you that isn't necessarily the hype thing and be good at something

443
00:46:01,520 --> 00:46:07,440
that you're good at. And that's probably going to be good enough. The time will come when whatever

444
00:46:07,440 --> 00:46:13,520
you're working on will come around. So and the other thing is that I think interdisciplinary

445
00:46:13,520 --> 00:46:19,200
spaces are where a lot of value comes. That doesn't mean you need to split yourself between like two

446
00:46:19,200 --> 00:46:27,440
wildly different subjects. But if you can find some time to stretch into subject areas that

447
00:46:27,440 --> 00:46:33,040
people aren't otherwise necessarily working on, that can make a big difference. I mean,

448
00:46:33,040 --> 00:46:38,080
I kept talking about being like a magpie and just grabbing different random bits and pieces to stick

449
00:46:38,080 --> 00:46:43,840
together. And that's because I touched on a few different fields. So you know, a bunch of pure

450
00:46:43,840 --> 00:46:49,280
math, but some machine learning things, algorithmic things, and just being able to have enough of a

451
00:46:49,280 --> 00:46:54,000
stretch to grab things from different fields that other people aren't putting together. That's often

452
00:46:54,000 --> 00:47:00,960
what you need to create something new. Yeah, yeah, I have to agree. And then just like a small

453
00:47:00,960 --> 00:47:08,160
extension of that, I guess the interdisciplinary part to it. I always found like when I was doing

454
00:47:08,160 --> 00:47:13,120
research, when I was, you know, in my engineering program, you're very like, you can be very insulated

455
00:47:13,120 --> 00:47:18,640
sometimes around people who are very like minded and who approach the problems in the same way.

456
00:47:18,640 --> 00:47:24,560
So sometimes like being exposed to other people that are just look at the world in a different way

457
00:47:24,560 --> 00:47:29,120
or think of the problem in a different way. Are there any people either in your work or in the

458
00:47:29,120 --> 00:47:33,600
fields that you found that have kind of like opened up the way that you think about things?

459
00:47:35,040 --> 00:47:44,240
There are many, many people. Let me see if I can think of a few that just spring to mind. So Matt

460
00:47:44,240 --> 00:47:50,960
Rocklin, who built Dask and runs Coiled, I don't know if you've ever had the opportunity to interact

461
00:47:50,960 --> 00:47:56,080
with Matt or watch him interacting with other people, but he is amazing at going and listening

462
00:47:56,080 --> 00:48:04,000
to people about whatever their problem is. And that is inspiring because that's how you actually

463
00:48:04,000 --> 00:48:10,400
find out what you need to build. Not necessarily sitting down yourself and coming up with the

464
00:48:10,960 --> 00:48:17,920
whatever you think is cool. Work out what the problems that your potential users are actually

465
00:48:17,920 --> 00:48:24,000
having actually are and listen to them. So and that is something I met is amazing at that.

466
00:48:24,000 --> 00:48:32,160
Who else? Lorena Barba has done amazing work on reproducibility and also just education. So

467
00:48:33,600 --> 00:48:38,560
thinking about how to explain things well, especially with like interactive tooling or

468
00:48:38,560 --> 00:48:43,360
anything like that. She's done a fabulous job about just computational thinking in general.

469
00:48:45,120 --> 00:48:50,560
Vincent Warmerdam is awesome because he always wants to build the simplest thing that still

470
00:48:50,560 --> 00:48:56,640
works. And that is very much something that I definitely believe in. And he's just so great at

471
00:48:56,640 --> 00:49:02,560
always putting those together and explaining it all so well. Absolutely. And then for something

472
00:49:02,560 --> 00:49:10,880
completely different, Emily Ryle, who's a category theorist, she does the most amazing,

473
00:49:10,880 --> 00:49:16,800
incredibly deep, complicated pure math work and still writes like great textbooks. Her

474
00:49:16,800 --> 00:49:22,320
introduction to category theory is one that I would recommend for anyone category theory in

475
00:49:22,320 --> 00:49:28,320
context is just a great introduction to the subject and it's just so approachable. And she's

476
00:49:28,320 --> 00:49:34,720
just amazing in general. So there's four people off the top of my head. Okay, that's great. I'm

477
00:49:34,720 --> 00:49:41,120
going to look. Well, I know Vincent and I'll look into the other ones. All right, I think we're

478
00:49:41,120 --> 00:49:46,880
ready for the final or almost final question. So, well, yeah, you describe yourself, I think, what,

479
00:49:46,880 --> 00:49:51,440
as a data, as a mathematician turned data science, but I'll phrase the question,

480
00:49:52,080 --> 00:49:59,280
what is a career in research taught you about life? Well, one of the things I've learned

481
00:49:59,760 --> 00:50:05,840
talking to domain experts on their data is that there's a whole lot that I know almost nothing

482
00:50:05,840 --> 00:50:12,400
about. In fact, almost everything I know almost nothing about. And that's always worth keeping

483
00:50:12,400 --> 00:50:18,560
in mind is how many other different things there are out there and how little you know.

484
00:50:20,960 --> 00:50:30,400
So, research also taught me I definitely don't have all the answers and that's okay. You've got

485
00:50:30,400 --> 00:50:36,000
to learn to live with that. You don't have all the answers, but maybe you can get there and

486
00:50:36,000 --> 00:50:42,480
that you just need to be patient with problems. Sometimes these things just need to sit in the

487
00:50:42,480 --> 00:50:48,720
back of your brain for a long time and then eventually, I don't know what subconscious

488
00:50:48,720 --> 00:50:54,320
process works back there, but eventually answers do come out. So just be patient.

489
00:50:54,320 --> 00:51:03,120
I like that. Yeah. I think that's research has always led me not to answers, but to just more

490
00:51:03,120 --> 00:51:10,000
questions. And then I can definitely relate to the second one, sometimes problems,

491
00:51:10,880 --> 00:51:15,120
not that they solve themselves, but like once you stop thinking about what the solution is

492
00:51:15,120 --> 00:51:20,640
going to be, or you do something else, go exercise or like, you know, the shower ideas where you're

493
00:51:20,640 --> 00:51:26,240
not thinking about anything else, that's when you get some of your best ideas. That's great.

494
00:51:26,240 --> 00:51:33,280
That's really good advice. It's definitely how you could apply some of the thinking of research to

495
00:51:33,840 --> 00:51:39,200
dealing with the uncertainty of life and yeah, how little we know. Like how little we know,

496
00:51:39,920 --> 00:51:45,760
the only thing I know is that I know nothing. Leland, this has been such a pleasure.

497
00:51:45,760 --> 00:51:50,800
I've really enjoyed talking about all of these things. Thanks for going through

498
00:51:50,800 --> 00:51:56,160
topic modeling, all of your amazing libraries. If there are listeners out there that want to

499
00:51:56,160 --> 00:52:01,440
learn more about you or any of the work that you're doing, where would you direct them?

500
00:52:03,680 --> 00:52:09,920
Probably GitHub, I guess, has most of the projects. I try and make everything open

501
00:52:09,920 --> 00:52:16,560
source as much as possible. So Al McKinnis on GitHub, but also the Tat Institute

502
00:52:17,280 --> 00:52:24,640
GitHub page, you can find a bunch of our projects there as well. You can also find me,

503
00:52:24,640 --> 00:52:32,800
I think, on Twitter and Blue Sky, search for my name. I guess I have a sufficiently novel name

504
00:52:32,800 --> 00:52:38,080
that hopefully you'll find me. And my email address is out there, so you can always reach out there

505
00:52:38,080 --> 00:52:43,760
or on LinkedIn or whatever if you want to get in touch. Very cool. Yeah, and for anyone that is

506
00:52:43,760 --> 00:52:49,520
not familiar with these libraries, you should definitely check them out. UMAP, HDB scan,

507
00:52:49,520 --> 00:52:54,880
and Datamap plot, and I'm sure soon enough there'll be some more exciting ones or extensions to these.

508
00:52:56,800 --> 00:53:01,040
Leland, thank you so much for the work that you do. Thank you so much for the time

509
00:53:01,040 --> 00:53:09,040
and letting me pick your brain for a little while. Thank you. Appreciate it.

510
00:53:11,840 --> 00:53:16,160
On this episode of Learning from Machine Learning, I had the privilege of speaking with

511
00:53:16,160 --> 00:53:22,480
Leland McKinnis, the creator of a suite of data science tools, including UMAP, HDB scan,

512
00:53:22,480 --> 00:53:27,920
and Datamap plot. His work has significantly impacted the field, particularly in the realm

513
00:53:27,920 --> 00:53:33,760
of unsupervised learning. What sets Leland apart is his unique approach to data science problems,

514
00:53:34,400 --> 00:53:39,520
deeply rooted in his background in pure mathematics, particularly algebraic topology.

515
00:53:40,160 --> 00:53:46,240
He views data through a geometric lens, seeking to understand its underlying structure. This led

516
00:53:46,240 --> 00:53:51,840
him to develop UMAP, an essential dimensionality reduction technique that excels at capturing

517
00:53:51,840 --> 00:53:58,800
both local and global structure of data sets. We also discussed HDB scan, a robust clustering

518
00:53:58,800 --> 00:54:04,160
algorithm known for its ability to handle noise and variable densities within data sets,

519
00:54:04,160 --> 00:54:10,240
making it highly effective for real-world applications. Beyond his technical contributions,

520
00:54:10,240 --> 00:54:15,840
Leland shared valuable insights for aspiring data scientists and researchers. He stressed the

521
00:54:15,840 --> 00:54:21,760
importance of not just blindly following hype, but instead pursuing passions. He discussed the

522
00:54:21,760 --> 00:54:26,640
importance of embracing interdisciplinary thinking, and how many of his breakthroughs came from

523
00:54:26,640 --> 00:54:32,160
connections between seemingly disparate areas of study. Leland highlighted the importance of

524
00:54:32,160 --> 00:54:37,840
decomposing black boxes, encouraging a deeper understanding of how algorithms work, rather

525
00:54:37,840 --> 00:54:43,520
than treating them as impenetrable mysteries. By breaking down complex algorithms into their

526
00:54:43,520 --> 00:54:49,200
fundamental components, data scientists gain knowledge and flexibility to adapt them to new

527
00:54:49,200 --> 00:54:55,280
problems and data types. This approach promotes transparency and empowers data scientists to

528
00:54:55,280 --> 00:55:02,480
be more than just algorithm users. They become algorithm creators and innovators. Leland's journey

529
00:55:02,480 --> 00:55:07,120
underscores the power of curiosity, the importance of interdisciplinary thinking,

530
00:55:07,120 --> 00:55:12,800
and the value of understanding the tools we use. By embracing these principles, we can unlock the

531
00:55:12,800 --> 00:55:17,280
true potential of data science and continue to push the boundaries of what's possible.

532
00:55:17,280 --> 00:55:22,480
Thank you for listening, and be sure to subscribe and share with a friend or

533
00:55:22,480 --> 00:55:48,080
colleague. Until next time, keep on learning.