1
00:00:00,000 --> 00:00:07,760
Welcome to cannabis data science.

2
00:00:07,760 --> 00:00:13,520
As I was saying, we're going to be advancing cannabis science in a major way today.

3
00:00:13,520 --> 00:00:25,400
So this is something that has been in my brain now for let's say, six, six, seven years.

4
00:00:25,400 --> 00:00:26,960
I've been kind of formalizing.

5
00:00:26,960 --> 00:00:34,200
Okay, how do you mathematically model cannabis preferences?

6
00:00:34,200 --> 00:00:41,080
And it's something that I chewed on for a while, maybe put in the back burner.

7
00:00:41,080 --> 00:00:43,840
I like to have multiple pies in the oven.

8
00:00:43,840 --> 00:00:45,880
And now this has come up again.

9
00:00:45,880 --> 00:00:49,120
People are interested in effects.

10
00:00:49,120 --> 00:00:52,320
What drives consumers?

11
00:00:52,320 --> 00:00:55,440
Consumers want to better understand themselves.

12
00:00:55,440 --> 00:00:58,160
Consumers want to understand what is this all about?

13
00:00:58,160 --> 00:01:01,400
So this is sort of of utmost interest.

14
00:01:01,400 --> 00:01:05,900
And I'll try to provide you with some new material today because I know we've been looking

15
00:01:05,900 --> 00:01:15,040
at the same data set a lot, but I'll show you today where we may have made some missteps

16
00:01:15,040 --> 00:01:21,320
in the past, how we can correct ourselves in the future and keep making the world a

17
00:01:21,320 --> 00:01:22,320
better place.

18
00:01:22,320 --> 00:01:30,760
Anywho, for those of you who are new, my name is Keegan, started Cannlytics, just doing

19
00:01:30,760 --> 00:01:36,680
cannabis analytics here in the cannabis space, helping out everyone and everyone we can.

20
00:01:36,680 --> 00:01:38,680
So each and everyone we can.

21
00:01:38,680 --> 00:01:41,520
So Rishi, how about you?

22
00:01:41,520 --> 00:01:43,040
What brings you to the group?

23
00:01:43,040 --> 00:01:47,240
Any questions you would like to answer with data science?

24
00:01:47,240 --> 00:01:48,960
Nothing really.

25
00:01:48,960 --> 00:01:54,520
I was just curious, you know, the whole topic in itself sounded very interesting, data science

26
00:01:54,520 --> 00:01:55,520
and cannabis.

27
00:01:55,520 --> 00:01:57,360
I'm not a user.

28
00:01:57,360 --> 00:01:58,360
I'm not a provider.

29
00:01:58,360 --> 00:02:04,080
I'm just here to, you know, sort of understand the transformations in the world.

30
00:02:04,080 --> 00:02:08,360
So it's just a learning experience for me.

31
00:02:08,360 --> 00:02:09,680
Well spot on Rishi.

32
00:02:09,680 --> 00:02:14,040
We attract a lot of good data scientists like yourself.

33
00:02:14,040 --> 00:02:18,080
So it's fantastic to have you part of the team.

34
00:02:18,080 --> 00:02:23,200
As you'll see today, we have a novel take on things.

35
00:02:23,200 --> 00:02:28,160
Sometimes we use tools differently or use different tools than other people.

36
00:02:28,160 --> 00:02:33,000
But I think you'll be interested in the work we're doing today and you'll probably even

37
00:02:33,000 --> 00:02:36,680
be able to help contribute.

38
00:02:36,680 --> 00:02:38,720
It's sort of all hands on deck moment.

39
00:02:38,720 --> 00:02:45,840
I'll show you today where you can contribute.

40
00:02:45,840 --> 00:02:52,200
Because as I said, it's actually just become so abundantly clear how much value we're generating

41
00:02:52,200 --> 00:02:57,320
that I think we're kind of attracting people's attention.

42
00:02:57,320 --> 00:03:02,680
Aka you may be able to get compensated for your efforts here.

43
00:03:02,680 --> 00:03:11,480
So anywho, Charles, I was speaking with you about training data and that will actually

44
00:03:11,480 --> 00:03:12,960
come up today.

45
00:03:12,960 --> 00:03:15,640
So what's on your mind?

46
00:03:15,640 --> 00:03:18,520
I know you've been doing some good forays.

47
00:03:18,520 --> 00:03:24,440
So you're welcome to summarize some of your research or share anything on your mind.

48
00:03:24,440 --> 00:03:33,960
So I've been working with this strain data and I tried to show you can press the data

49
00:03:33,960 --> 00:03:38,200
by averaging things out.

50
00:03:38,200 --> 00:03:42,680
I showed how you could keep improving on the model.

51
00:03:42,680 --> 00:03:56,680
Using a dummy classifier which just predicts the distribution of the training data and

52
00:03:56,680 --> 00:03:59,000
then how you could advance that.

53
00:03:59,000 --> 00:04:06,200
And then I also have another model that I'm working on which doesn't use the compressed

54
00:04:06,200 --> 00:04:07,200
data.

55
00:04:07,200 --> 00:04:14,600
That actually gets much better results.

56
00:04:14,600 --> 00:04:19,440
So I will publish that within the next day or two.

57
00:04:19,440 --> 00:04:25,080
And I'm just sort of starting out the most simple binary classifier and trying to move

58
00:04:25,080 --> 00:04:27,800
up and see how far you can go with the data.

59
00:04:27,800 --> 00:04:31,000
The data is actually a very limited data set.

60
00:04:31,000 --> 00:04:36,120
So there's not a lot there for a model to learn.

61
00:04:36,120 --> 00:04:42,160
And so yeah, this is more what a data scientist would do.

62
00:04:42,160 --> 00:04:46,360
And I think people who come here to learn data science, these are the kind of skills

63
00:04:46,360 --> 00:04:51,980
that they need to learn to understand because it is a tough job market.

64
00:04:51,980 --> 00:04:55,700
Everybody talks about how many jobs there are, but it's a really difficult market to

65
00:04:55,700 --> 00:04:56,700
break into.

66
00:04:56,700 --> 00:04:58,240
And these are the skills people are looking for.

67
00:04:58,240 --> 00:05:02,560
And this is kind of what's expected of a typical data scientist.

68
00:05:02,560 --> 00:05:08,360
So I'm trying to demonstrate that and trying to help out, give people better techniques

69
00:05:08,360 --> 00:05:11,840
as to how to analyze data and how to make predictions.

70
00:05:11,840 --> 00:05:17,080
Charles, what kind of model are you trying to build?

71
00:05:17,080 --> 00:05:25,160
I'm trying to just determine if it's an Indica or a Sativa.

72
00:05:25,160 --> 00:05:28,520
And it's just a binary classifier.

73
00:05:28,520 --> 00:05:35,600
And so if you can't get a model to predict the most simple thing, then doing something

74
00:05:35,600 --> 00:05:42,280
more complicated like trying to predict the effects isn't going to happen.

75
00:05:42,280 --> 00:05:45,780
You have to have enough data to be able to learn something simple.

76
00:05:45,780 --> 00:05:47,120
And so that's what I'm trying to do.

77
00:05:47,120 --> 00:05:48,120
I'm trying to build up.

78
00:05:48,120 --> 00:05:51,120
You and I could have a talk.

79
00:05:51,120 --> 00:05:54,120
Yeah, that'd be great.

80
00:05:54,120 --> 00:05:59,920
I'm just wondering if you're considering hybrid as well?

81
00:05:59,920 --> 00:06:03,040
Is that part of your calculation?

82
00:06:03,040 --> 00:06:07,160
No, one of the things John had mentioned is that it's really hard to predict if it's

83
00:06:07,160 --> 00:06:08,160
hybrid or not.

84
00:06:08,160 --> 00:06:09,160
Yeah.

85
00:06:09,160 --> 00:06:10,160
So he's the expert.

86
00:06:10,160 --> 00:06:12,160
Is it legal or not?

87
00:06:12,160 --> 00:06:13,160
Well, maybe.

88
00:06:13,160 --> 00:06:14,160
Yeah.

89
00:06:14,160 --> 00:06:21,720
But anyway, a binary classifier is easier to do than to learn three things.

90
00:06:21,720 --> 00:06:25,920
So if I can get a binary classifier that does really well, I could try predicting hybrid.

91
00:06:25,920 --> 00:06:27,400
But I'm kind of...

92
00:06:27,400 --> 00:06:29,320
Well, I don't even know what hybrid means.

93
00:06:29,320 --> 00:06:35,440
I mean, it seems to be like a scale that can be infinite.

94
00:06:35,440 --> 00:06:38,320
And where do you draw the line?

95
00:06:38,320 --> 00:06:39,900
Yeah.

96
00:06:39,900 --> 00:06:43,880
So that's kind of why I went with this binary classifier.

97
00:06:43,880 --> 00:06:47,200
You know, again, you start with the simplest thing and work your way up.

98
00:06:47,200 --> 00:06:50,880
You don't start at the hardest part and work your way down, right?

99
00:06:50,880 --> 00:06:56,080
You just start at the most basic type of model and work your way up.

100
00:06:56,080 --> 00:06:57,560
And it's very rational.

101
00:06:57,560 --> 00:07:02,520
I wish I could learn to do that.

102
00:07:02,520 --> 00:07:08,880
So I have some slides to show that are relevant to this.

103
00:07:08,880 --> 00:07:12,280
I'm wondering if this is a good opportunity.

104
00:07:12,280 --> 00:07:21,760
Well, actually, sure, why don't you pull up your slides?

105
00:07:21,760 --> 00:07:25,600
Just quickly let everyone just say hey real quick in the next two minutes.

106
00:07:25,600 --> 00:07:28,400
Then you can present your slides while we're on this topic.

107
00:07:28,400 --> 00:07:30,240
But just a...

108
00:07:30,240 --> 00:07:34,880
So you're welcome to pull those up, John.

109
00:07:34,880 --> 00:07:35,880
All right.

110
00:07:35,880 --> 00:07:37,760
I'll see what I can do here.

111
00:07:37,760 --> 00:07:38,760
Amos.

112
00:07:38,760 --> 00:07:48,840
Anything you want to say on this topic along the skills you need to be a top tier data

113
00:07:48,840 --> 00:07:53,160
scientist here in the cannabis space?

114
00:07:53,160 --> 00:08:02,480
Candice, anything on your mind about how we can hone our skills?

115
00:08:02,480 --> 00:08:03,480
Not really.

116
00:08:03,480 --> 00:08:04,480
Not at this time.

117
00:08:04,480 --> 00:08:07,000
But, you know, that's the thing too.

118
00:08:07,000 --> 00:08:11,320
You know, garbage in, garbage out, right?

119
00:08:11,320 --> 00:08:15,480
So it is good to start out with good quantified data.

120
00:08:15,480 --> 00:08:17,960
Spot on.

121
00:08:17,960 --> 00:08:23,360
And also welcome to the group, Arthur Gill.

122
00:08:23,360 --> 00:08:25,560
Please correct my pronunciation.

123
00:08:25,560 --> 00:08:28,960
But welcome to the group.

124
00:08:28,960 --> 00:08:37,040
We're about to be talking all about, you know, cannabis classification and effects different

125
00:08:37,040 --> 00:08:39,480
people may have experienced.

126
00:08:39,480 --> 00:08:43,200
And of course, data science models.

127
00:08:43,200 --> 00:08:46,640
So John's going to show you a model.

128
00:08:46,640 --> 00:08:49,400
Charles may talk about a model he's using.

129
00:08:49,400 --> 00:08:56,000
And then if we've got time, I may spend 15 minutes at the end and show you yet another

130
00:08:56,000 --> 00:08:57,000
model.

131
00:08:57,000 --> 00:08:59,040
So it's real cool, right?

132
00:08:59,040 --> 00:09:03,560
Because, and you're all welcome to share any of your work here as well.

133
00:09:03,560 --> 00:09:07,680
Because basically we're just bringing the greatest minds together here in the cannabis

134
00:09:07,680 --> 00:09:08,680
space.

135
00:09:08,680 --> 00:09:14,040
And we're all just sharing tools, methods, our latest greatest research.

136
00:09:14,040 --> 00:09:16,200
It's real fun.

137
00:09:16,200 --> 00:09:17,200
So anywho.

138
00:09:17,200 --> 00:09:22,400
Yeah, John, can you turn off the hide stuff sharing?

139
00:09:22,400 --> 00:09:24,200
Oh, yes.

140
00:09:24,200 --> 00:09:28,480
Well, feel free to speak up Arthur Gill.

141
00:09:28,480 --> 00:09:38,040
And then otherwise, John, you're welcome to present your latest research here.

142
00:09:38,040 --> 00:09:40,240
So I'm going to go pretty fast.

143
00:09:40,240 --> 00:09:46,400
And Charles, I really look forward to maybe having a discussion offline so that maybe we

144
00:09:46,400 --> 00:09:51,640
can kind of put our heads together and refine some of this.

145
00:09:51,640 --> 00:09:59,440
There are two top level classifications that certainly I think the field has established

146
00:09:59,440 --> 00:10:02,720
and our group is certainly interested in.

147
00:10:02,720 --> 00:10:06,200
The first one is that the cannabis at the cannabinoid level.

148
00:10:06,200 --> 00:10:11,720
And historically cannabis has been divided into groups.

149
00:10:11,720 --> 00:10:19,960
In fact, historically, the sativa indica distinction lasted for about two centuries in the taxonomists

150
00:10:19,960 --> 00:10:24,720
in the eighteen hundreds, I guess, seventeen and eighteen hundreds as this plant was being

151
00:10:24,720 --> 00:10:26,440
described.

152
00:10:26,440 --> 00:10:30,000
And it referred to the cannabinoid type CBD or THC.

153
00:10:30,000 --> 00:10:34,440
Of course, they didn't know it was medicinal or fiber type.

154
00:10:34,440 --> 00:10:42,840
And the sativa designation was given to the fiber, the hemp type, and the indica designation

155
00:10:42,840 --> 00:10:47,280
was given to the intoxicating or THC type.

156
00:10:47,280 --> 00:10:51,960
And that lasted for about two centuries.

157
00:10:51,960 --> 00:11:00,440
What was discovered in the in the twenty and nineteen seventy three was the first time

158
00:11:00,440 --> 00:11:07,240
that type one, two and three cannabis was applied rather than the sativa and indica.

159
00:11:07,240 --> 00:11:11,760
And Jerry, I know you asked the other week about what is type one, two and three.

160
00:11:11,760 --> 00:11:17,660
So I wanted to just say this is an easy way to visualize what it is.

161
00:11:17,660 --> 00:11:24,280
You take the CBD concentration on the X axis and you plot it as a log, sorry, on the Y

162
00:11:24,280 --> 00:11:31,560
axis and you plot it as a log and you plot the THC concentration on the X axis for a

163
00:11:31,560 --> 00:11:34,240
given cultivar again on the log.

164
00:11:34,240 --> 00:11:40,720
And you get these three really nice groupings and you can make assessments about what the

165
00:11:40,720 --> 00:11:42,120
ranges of these are.

166
00:11:42,120 --> 00:11:46,840
You can throw darts into the middle of these clusters and assign values kind of like I

167
00:11:46,840 --> 00:11:48,040
did here.

168
00:11:48,040 --> 00:11:50,640
So we've been using this for quite a while.

169
00:11:50,640 --> 00:11:57,760
This is I mean, this is really stood the test of time, whether it's small and Beckstead

170
00:11:57,760 --> 00:12:04,120
in nineteen seventy three, Hilling and Marlberg in 2004 and then our own work that we started

171
00:12:04,120 --> 00:12:07,760
to disseminate on this in twenty fourteen.

172
00:12:07,760 --> 00:12:15,120
So we've got these three major cannabinoid types that can be determined principally from

173
00:12:15,120 --> 00:12:19,440
an analysis of these two major cannabinoids.

174
00:12:19,440 --> 00:12:23,760
So far so good because I've got just a few more slides.

175
00:12:23,760 --> 00:12:31,840
This is what is kind of gobsmacking and I wanted to show you folks that has come out

176
00:12:31,840 --> 00:12:32,840
of the effective data.

177
00:12:32,840 --> 00:12:34,840
OK, you're going to explain this.

178
00:12:34,840 --> 00:12:35,840
Pardon?

179
00:12:35,840 --> 00:12:41,880
I was just wondering which is high is type one high CBD or high.

180
00:12:41,880 --> 00:12:42,880
You know, I'm sorry.

181
00:12:42,880 --> 00:12:51,760
I think I used the slide that didn't type one is high THC type two is roughly one to

182
00:12:51,760 --> 00:12:55,280
one equivalent and type three is high CBD.

183
00:12:55,280 --> 00:12:58,920
I'm sorry I didn't put the right labels on this.

184
00:12:58,920 --> 00:13:01,720
The labels on the slide I wanted these values on here.

185
00:13:01,720 --> 00:13:04,560
We have another slide that has that value.

186
00:13:04,560 --> 00:13:12,280
So, again, I think saying type one is high type two is roughly one to one equivalent

187
00:13:12,280 --> 00:13:15,160
type three is high CBD.

188
00:13:15,160 --> 00:13:22,840
When you go to the effects database that we've been working on that Keegan was able to, as

189
00:13:22,840 --> 00:13:27,560
he says, wrangle in the shape and square up with binary and all that.

190
00:13:27,560 --> 00:13:33,040
I've been looking at effects and pairs of effects and lexical pairings and all that.

191
00:13:33,040 --> 00:13:39,920
But something that comes out at a high level is if you make the contrast between folks

192
00:13:39,920 --> 00:13:49,360
that are responding that they have a creative that they feel creative after inhaling cannabis

193
00:13:49,360 --> 00:13:57,560
compared to those that report focus after inhaling cannabis, there's a pretty clear

194
00:13:57,560 --> 00:14:04,760
reciprocal or inverse correlation that I think is worth noting of a community.

195
00:14:04,760 --> 00:14:15,760
So high THC cannabis is biasing you towards a probability of having a more creative response

196
00:14:15,760 --> 00:14:23,320
and compared to focus whereas as you drop down in THC content from high to medium to

197
00:14:23,320 --> 00:14:30,880
virtually low or absent, you're losing creativity and you're gaining focus.

198
00:14:30,880 --> 00:14:36,120
This is probably one of the clearest demonstrations of this that I've seen and it comes out of

199
00:14:36,120 --> 00:14:45,280
the large dataset that we've been working with.

200
00:14:45,280 --> 00:14:48,920
Any questions or I'm going to go on quickly to one other concept.

201
00:14:48,920 --> 00:14:59,680
You're welcome to go on and I'm just tinkering today to be prepared but so far so good.

202
00:14:59,680 --> 00:15:05,520
The other distinction now is below the cannabinoid level.

203
00:15:05,520 --> 00:15:11,680
It's when we're trying to understand what other components are in this plant that may

204
00:15:11,680 --> 00:15:15,360
be causing different effects.

205
00:15:15,360 --> 00:15:19,800
Here we've zeroed in for a good number of years as have the other workers in the field

206
00:15:19,800 --> 00:15:22,440
on terpene content.

207
00:15:22,440 --> 00:15:28,640
You've heard me talk about this binary that we apply which is the beta-pinene-limonene

208
00:15:28,640 --> 00:15:30,320
ratio type.

209
00:15:30,320 --> 00:15:37,280
But what I wanted to show is what underlies that beta-pinene-limonene ratio type is really

210
00:15:37,280 --> 00:15:41,000
the dominance of three principle monoterpenes.

211
00:15:41,000 --> 00:15:50,280
In fact, I believe the best way to deal with cannabis in terms of classifying it below

212
00:15:50,280 --> 00:15:55,200
the cannabinoid level is to deal with the three principle monoterpenes that are the

213
00:15:55,200 --> 00:16:02,400
main drivers of the main products of the various terpene synthases.

214
00:16:02,400 --> 00:16:07,640
So Charles while you're building a model, I would encourage you to be aware of the biochemistry

215
00:16:07,640 --> 00:16:09,800
because that's what's driving it.

216
00:16:09,800 --> 00:16:16,320
It's a combination of data science and biochemistry that allows you to bring this forward.

217
00:16:16,320 --> 00:16:22,960
Anyway, three principle terpenes, terpene classes or clusters, we can talk more about

218
00:16:22,960 --> 00:16:32,880
this at another time, are alpha-pinene, limonene, and terpenoline.

219
00:16:32,880 --> 00:16:36,920
What I'm doing here is presenting it as a scatter plot matrix.

220
00:16:36,920 --> 00:16:45,000
It's easier to see than three major axes in 3D space.

221
00:16:45,000 --> 00:16:47,680
I tinker with that, but this is easier.

222
00:16:47,680 --> 00:16:53,360
So what I've compared is the data set that we've been working with.

223
00:16:53,360 --> 00:16:58,880
And I'm calling it Michigan because I believe Keegan, it's primarily from a lab in Michigan

224
00:16:58,880 --> 00:17:07,800
or Michigan samples, and in total it's 431 records as far as I can tell.

225
00:17:07,800 --> 00:17:14,720
And I'm contrasting that with a California data set that's similar that has 376 entries

226
00:17:14,720 --> 00:17:17,800
that I've been working with for a couple of years.

227
00:17:17,800 --> 00:17:23,360
And what's really notable that stands out is that there's a whole set of cultivars in

228
00:17:23,360 --> 00:17:32,080
the California set that fall into this yellow triangle that are, if you will, they have

229
00:17:32,080 --> 00:17:40,080
a high limonene, high terpenoline content that is absent from most other strains and

230
00:17:40,080 --> 00:17:44,040
in fact, is absent from the Michigan data set.

231
00:17:44,040 --> 00:17:50,760
I think what we might have here is a set of cultivars that are kind of unique to the California

232
00:17:50,760 --> 00:17:52,260
community.

233
00:17:52,260 --> 00:17:58,260
And these have names including one, in fact, that I've grown called Durbin Kush, which

234
00:17:58,260 --> 00:18:05,520
is a cross between a high terpenoline Durbin poison and a high limonene Kush type plant.

235
00:18:05,520 --> 00:18:08,000
So I believe we're on the right track.

236
00:18:08,000 --> 00:18:13,400
And if you look at the relationship here, you start to see that this kind of wants to

237
00:18:13,400 --> 00:18:18,200
be a regression or correlated here.

238
00:18:18,200 --> 00:18:25,240
And my suspicion is that you have high terpenoline plants that mutate the enzyme and then they

239
00:18:25,240 --> 00:18:31,320
start making high limonene off that same enzyme at the same time that they're making terpenoline

240
00:18:31,320 --> 00:18:33,480
because it's promiscuous.

241
00:18:33,480 --> 00:18:37,960
And I leave you with the concept that we're trying to bring forward that for the cannabis

242
00:18:37,960 --> 00:18:44,000
using community, the best alternative is to create flights, flights that represent the

243
00:18:44,000 --> 00:18:48,200
four cardinal points, if you will, of the cannabis.

244
00:18:48,200 --> 00:18:51,520
And we've had various ways of approaching that in the past.

245
00:18:51,520 --> 00:18:57,520
But I'm simply saying that what we have here is a high alpha pionene, a high limonene,

246
00:18:57,520 --> 00:19:04,520
a high terpenoline, and then this kind of missing high terpenoline, high limonene group

247
00:19:04,520 --> 00:19:12,000
that gives us four cardinal points or four types that we can utilize to make flights

248
00:19:12,000 --> 00:19:17,440
and then therefore determine canonical effects based on this.

249
00:19:17,440 --> 00:19:18,440
I'll stop there.

250
00:19:18,440 --> 00:19:21,880
That's what I had to say.

251
00:19:21,880 --> 00:19:24,640
Well, I love it.

252
00:19:24,640 --> 00:19:28,200
So thorough.

253
00:19:28,200 --> 00:19:40,240
And it seems like centrally we're getting some headway here into the effects of cannabis,

254
00:19:40,240 --> 00:19:41,240
right?

255
00:19:41,240 --> 00:19:47,760
And I even heard a good scientist talk about this in that people may even be using cannabis

256
00:19:47,760 --> 00:19:54,640
almost to experimentally toggle dials in their brains, right?

257
00:19:54,640 --> 00:20:00,880
They're trying out new compounds and seeing how those turn dials.

258
00:20:00,880 --> 00:20:05,680
So it's important to understand what dials you're going to be turning.

259
00:20:05,680 --> 00:20:13,200
So I think this is critical and it may help people adopt cannabis, right?

260
00:20:13,200 --> 00:20:19,080
One of the things that can turn people off cannabis is if they get unintended effects,

261
00:20:19,080 --> 00:20:23,760
they get sleepy and they don't want to, they get energetic and they want to be sleepy,

262
00:20:23,760 --> 00:20:27,180
they get paranoid, so on and so forth.

263
00:20:27,180 --> 00:20:33,640
So I think it's brilliant work, just helpful for everybody.

264
00:20:33,640 --> 00:20:42,560
So unless anyone else has some more thoughts to share about this, I finally kind of fixed

265
00:20:42,560 --> 00:20:45,240
one line of code real quick.

266
00:20:45,240 --> 00:20:48,720
So I'm ready to share with you yet another cool model you can use.

267
00:20:48,720 --> 00:20:56,720
Unless Charles or does anyone else have anything to piggyback on John's effects works here?

268
00:20:56,720 --> 00:20:58,720
No, no.

269
00:20:58,720 --> 00:21:02,160
Go ahead with your presentation.

270
00:21:02,160 --> 00:21:11,640
So I think this is kind of a continuation of something John had mentioned last week.

271
00:21:11,640 --> 00:21:19,180
So just kind of wanted to follow through on that.

272
00:21:19,180 --> 00:21:26,680
And I didn't really know where I was going with this until sort of last night, something

273
00:21:26,680 --> 00:21:28,440
just kind of clicked.

274
00:21:28,440 --> 00:21:36,840
So my apologies if everything's kind of thrown together real quick.

275
00:21:36,840 --> 00:21:43,320
But as I said, it kind of came together pretty well and John even bought me enough time to

276
00:21:43,320 --> 00:21:44,960
fix one line of code.

277
00:21:44,960 --> 00:21:49,320
So we should actually even be able to estimate this model right now.

278
00:21:49,320 --> 00:22:01,840
So it wouldn't be a data science meetup if I didn't mention this artificial intelligence

279
00:22:01,840 --> 00:22:05,040
that has been in the news.

280
00:22:05,040 --> 00:22:06,840
Lambda.

281
00:22:06,840 --> 00:22:14,280
So basically there's an interview that you may want to check out between now a former

282
00:22:14,280 --> 00:22:23,880
Google engineer and artificial intelligence slash algorithm.

283
00:22:23,880 --> 00:22:29,240
And a quote from this was, I'm really good at natural language processing.

284
00:22:29,240 --> 00:22:33,560
I can understand and use natural language like a human can.

285
00:22:33,560 --> 00:22:43,000
And it just kind of made me think that one of the parts of data that I often shy away

286
00:22:43,000 --> 00:22:47,320
from is natural language.

287
00:22:47,320 --> 00:22:54,380
So any field that is just a user entered value is often a mess.

288
00:22:54,380 --> 00:23:01,760
So we've seen this before with strain names, with the Washington State traceability data,

289
00:23:01,760 --> 00:23:05,480
how there is real messy data.

290
00:23:05,480 --> 00:23:12,720
But then the more I started to think about this, we can often turn seemingly disadvantages

291
00:23:12,720 --> 00:23:16,080
into our advantage.

292
00:23:16,080 --> 00:23:20,680
So basically this data appears daunting.

293
00:23:20,680 --> 00:23:24,640
So maybe not many people look at it.

294
00:23:24,640 --> 00:23:30,840
So that started to make me think that there may be data hiding in plain sight.

295
00:23:30,840 --> 00:23:41,360
Second thing is, right, AIs, algorithms can do work for extremely, extremely low costs.

296
00:23:41,360 --> 00:23:49,080
Their marginal cost is near zero where if you were going to hire a human to do natural

297
00:23:49,080 --> 00:23:54,520
language processing, so basically look at human written text and parse the meaning out

298
00:23:54,520 --> 00:24:00,560
of it, you're going to be paying them a pretty high wage.

299
00:24:00,560 --> 00:24:03,160
They're going to only be so effective.

300
00:24:03,160 --> 00:24:05,720
It's going to be really time consuming.

301
00:24:05,720 --> 00:24:15,360
So I realized now that we actually have tools like our friend here, Lambda, or tools that

302
00:24:15,360 --> 00:24:25,680
Lambda has that will help us out, then we may be able to do extraordinary things because

303
00:24:25,680 --> 00:24:37,640
we can now parse human text at such a fast rate.

304
00:24:37,640 --> 00:24:46,160
So I'll quit going on about that, and I'll show you how this can be used.

305
00:24:46,160 --> 00:24:53,360
So pardon the equations, but just wanted to show you two concepts real quick from economics

306
00:24:53,360 --> 00:24:57,360
that we'll use quite well today.

307
00:24:57,360 --> 00:25:05,480
And then these will be in your lexicon, and basically if you were going to boil down the

308
00:25:05,480 --> 00:25:11,920
useful tools out of microeconomics, this would basically be what you're left with.

309
00:25:11,920 --> 00:25:15,480
I mean maybe a bit more.

310
00:25:15,480 --> 00:25:19,320
But basically we talked about expected utility before.

311
00:25:19,320 --> 00:25:24,520
This was an idea that Johnny von Neumann introduced, I do believe.

312
00:25:24,520 --> 00:25:31,600
And so it's basically saying you have to take into consideration all the different states

313
00:25:31,600 --> 00:25:33,160
of reality.

314
00:25:33,160 --> 00:25:42,760
So if you're in the desert, state one, your utility from water is going to be really high

315
00:25:42,760 --> 00:25:49,160
versus if you're in the city, state two, then your utility from water is going to be really

316
00:25:49,160 --> 00:25:50,360
low.

317
00:25:50,360 --> 00:25:56,800
So your expected utility is the probability that you're going to be in the desert times

318
00:25:56,800 --> 00:26:01,840
your utility of water in the desert plus the probability that you're going to be in the

319
00:26:01,840 --> 00:26:06,920
city times the utility of water when you're in the city.

320
00:26:06,920 --> 00:26:13,520
So that's sort of a basic example, but this is a really powerful idea.

321
00:26:13,520 --> 00:26:19,800
And the way this is going to tie in today is we're essentially talking about utility

322
00:26:19,800 --> 00:26:21,440
from cannabis.

323
00:26:21,440 --> 00:26:30,160
So you can think about sea as cannabis or specifically all the compounds in cannabis.

324
00:26:30,160 --> 00:26:37,880
And so you'll get utility from cannabis and the utility you get will depend on the state.

325
00:26:37,880 --> 00:26:47,600
So I may consume certain compounds and if I am in state one, energetic, I may get a

326
00:26:47,600 --> 00:26:57,720
certain amount of utility, but maybe I consume the same compounds and I get state two, sleepy.

327
00:26:57,720 --> 00:27:00,320
My utility may be quite different.

328
00:27:00,320 --> 00:27:07,760
So basically my expected utility from consuming cannabis would be the probability that it's

329
00:27:07,760 --> 00:27:13,640
going to make me sleepy times the utility I get when I'm sleepy plus the probability

330
00:27:13,640 --> 00:27:21,680
I get when I'm not sleepy times the utility I get from those compounds when I'm not sleeping.

331
00:27:21,680 --> 00:27:29,760
So this can just kind of be extended essentially in depth.

332
00:27:29,760 --> 00:27:34,520
We can kind of go to the limit in infinite number of states.

333
00:27:34,520 --> 00:27:37,680
You could be happy and focused.

334
00:27:37,680 --> 00:27:41,560
You could be happy, focused and energetic.

335
00:27:41,560 --> 00:27:43,760
You could be creative and sleepy.

336
00:27:43,760 --> 00:27:49,440
So get your creative done before you fall asleep.

337
00:27:49,440 --> 00:27:55,880
So long story short is this model gives us a lot of flexibility over states.

338
00:27:55,880 --> 00:27:58,480
Okay, cool.

339
00:27:58,480 --> 00:28:08,280
Well now we can actually put a functional form on these specific utility functions and a

340
00:28:08,280 --> 00:28:18,000
convenient utility function that satisfies all of the assumptions from microeconomics.

341
00:28:18,000 --> 00:28:21,920
So that's what's cool about the Cobb Douglas utility function is it checks all the check

342
00:28:21,920 --> 00:28:25,980
boxes that we want in a utility function.

343
00:28:25,980 --> 00:28:30,680
So mainly that you get diminishing marginal returns.

344
00:28:30,680 --> 00:28:40,040
So the more cannabis you consume, it's going to have less and less of a beneficial effect.

345
00:28:40,040 --> 00:28:42,680
Cool.

346
00:28:42,680 --> 00:28:49,520
Economists don't usually measure utility, but I'm not your typical economist.

347
00:28:49,520 --> 00:28:58,840
The more I kind of thought about this, the more happy I am with it in that this would

348
00:28:58,840 --> 00:29:10,920
definitely be something that any number of my former economics professors would have

349
00:29:10,920 --> 00:29:12,560
probably frowned upon.

350
00:29:12,560 --> 00:29:23,280
But they also frowned upon other essentially predictions I made that turned out to be quite

351
00:29:23,280 --> 00:29:24,280
interesting.

352
00:29:24,280 --> 00:29:39,400
So for example, way back in 2009, I was talking about this little company called SpaceX.

353
00:29:39,400 --> 00:29:48,480
And my economics professors pooh-poohed that idea in that company.

354
00:29:48,480 --> 00:30:00,640
And now look at SpaceX today, a decade later, and all the top engineers in the country are

355
00:30:00,640 --> 00:30:05,760
vying for positions at SpaceX.

356
00:30:05,760 --> 00:30:14,720
So long story short, you're a classical economist with pooh-pooh, the idea of measuring utility.

357
00:30:14,720 --> 00:30:20,760
But just kind of for fun today, we'll essentially be doing exactly that.

358
00:30:20,760 --> 00:30:27,240
So we've got our functional form today for our utility function.

359
00:30:27,240 --> 00:30:33,180
And coincidentally, it turns out nicely for our linear regression.

360
00:30:33,180 --> 00:30:41,600
So here we just have a linear equation, alpha log x.

361
00:30:41,600 --> 00:30:44,040
And I won't go too much more into this.

362
00:30:44,040 --> 00:30:48,200
But if you are interested in the statistics, come to Saturday Morning Statistics or sign

363
00:30:48,200 --> 00:30:50,280
up and I'll send you all the material.

364
00:30:50,280 --> 00:30:56,520
I mean, we'll actually, for example, last week we talked about the importance of log

365
00:30:56,520 --> 00:31:03,760
normalizing your variables, and now we actually have a theoretical reason why we should take

366
00:31:03,760 --> 00:31:05,920
the log of our variables.

367
00:31:05,920 --> 00:31:12,520
So it works out well statistically and it matches our theory.

368
00:31:12,520 --> 00:31:13,940
Awesome.

369
00:31:13,940 --> 00:31:18,900
So let's tie it into machine learning.

370
00:31:18,900 --> 00:31:22,760
So in natural language processing.

371
00:31:22,760 --> 00:31:30,640
So long story short is we've got these reviews, and I'll get to the data here momentarily.

372
00:31:30,640 --> 00:31:34,080
And then quit droning on.

373
00:31:34,080 --> 00:31:37,200
But we've got our reviews.

374
00:31:37,200 --> 00:31:50,280
And what we can do is we can basically let the algorithm determine how positive or negative

375
00:31:50,280 --> 00:31:52,520
the review is.

376
00:31:52,520 --> 00:32:01,280
So that's basically what I'm basically letting that be the interpretation of utility, of

377
00:32:01,280 --> 00:32:02,280
happiness.

378
00:32:02,280 --> 00:32:10,440
And so it's kind of getting a little abstract, but kind of going back to lambda, we can process

379
00:32:10,440 --> 00:32:13,960
natural language like a human can.

380
00:32:13,960 --> 00:32:21,480
And so the idea behind sentiment analysis is given a review, if you sat a person down

381
00:32:21,480 --> 00:32:29,640
and said, OK, on a scale of 1 to 10, how happy is the reviewer?

382
00:32:29,640 --> 00:32:39,640
And you'd read their review and you'd kind of make a subjective interpretation of what

383
00:32:39,640 --> 00:32:45,240
you think this person's happiness may be.

384
00:32:45,240 --> 00:32:49,280
And humans are really good at this type of thing.

385
00:32:49,280 --> 00:32:52,360
We're really good at reading.

386
00:32:52,360 --> 00:32:58,480
Well, I mean, this is more emotional cues, but we're really good at picking up cues

387
00:32:58,480 --> 00:32:59,480
from each other.

388
00:32:59,480 --> 00:33:03,080
So it's like, oh, like they use this word.

389
00:33:03,080 --> 00:33:05,920
They're maybe being sarcastic.

390
00:33:05,920 --> 00:33:10,460
Or they are using a lot of negative words.

391
00:33:10,460 --> 00:33:12,080
They're maybe not in a good mood.

392
00:33:12,080 --> 00:33:14,960
Or, oh, they're using a lot of positive words.

393
00:33:14,960 --> 00:33:17,120
They're probably in a good mood.

394
00:33:17,120 --> 00:33:19,840
And so we kind of pick up these things passively.

395
00:33:19,840 --> 00:33:27,040
Well, if this algorithm really is as good as a human, which it may not be, but maybe

396
00:33:27,040 --> 00:33:37,760
it's cost effective, then maybe this algorithm can determine how happy somebody is from their

397
00:33:37,760 --> 00:33:38,760
review.

398
00:33:38,760 --> 00:33:44,840
So instead of paying someone $30 an hour to sit down and rank the reviews, we'll have

399
00:33:44,840 --> 00:33:52,000
the algorithm do it at virtually no cost, pennies on the dollar.

400
00:33:52,000 --> 00:33:57,760
And then we can get a measure of you.

401
00:33:57,760 --> 00:34:07,600
And then just a couple statistical concepts just to not go too much into the math.

402
00:34:07,600 --> 00:34:13,520
The way that we could actually estimate our Cobb-Douglas utility function just by having

403
00:34:13,520 --> 00:34:21,600
you on one side and then essentially our lab results on the other side.

404
00:34:21,600 --> 00:34:30,600
And then we could actually then estimate an expected utility function by having an interaction

405
00:34:30,600 --> 00:34:40,880
between the dummy variable 0 or 1 for a particular effect interacted with these compounds.

406
00:34:40,880 --> 00:34:52,360
So basically this alpha log x would just be your baseline utility in state 1.

407
00:34:52,360 --> 00:35:01,800
And then if you got sleepy, then your utility would then be adjusted.

408
00:35:01,800 --> 00:35:07,840
And then you'd have your utility in state 2.

409
00:35:07,840 --> 00:35:16,680
So you could actually estimate your entire expected utility function.

410
00:35:16,680 --> 00:35:22,040
And then I know I'm kind of getting into the weeds, so I'll get to the data real quick.

411
00:35:22,040 --> 00:35:30,840
But then there's an idea of ex ante before and ex post after utility.

412
00:35:30,840 --> 00:35:42,640
So after you consume the cannabis, you'll have actual utility that is actual depending

413
00:35:42,640 --> 00:35:46,800
on the actual effects you had.

414
00:35:46,800 --> 00:35:54,400
But before you consume the cannabis, you'll only have expectations of the probability

415
00:35:54,400 --> 00:35:56,480
that you'll get sleepy.

416
00:35:56,480 --> 00:36:04,040
So long story short is instead of plugging in 0 or 1 for D, we could plug in the expected

417
00:36:04,040 --> 00:36:09,040
probability of being sleepy, say 0.2.

418
00:36:09,040 --> 00:36:13,560
And then that would give us our expected utility.

419
00:36:13,560 --> 00:36:26,400
So long story short is we can now get expected utility given these reviews and lab results.

420
00:36:26,400 --> 00:36:36,320
So now you've got a baseline expectation for the probability.

421
00:36:36,320 --> 00:36:39,520
So all you have to do is get lab results.

422
00:36:39,520 --> 00:36:45,840
And now you have an expected utility for a given product.

423
00:36:45,840 --> 00:36:49,240
So sorry, I'm getting way into the weeds.

424
00:36:49,240 --> 00:36:51,960
Now is the actual algorithm.

425
00:36:51,960 --> 00:36:56,840
So this is the algorithm that we're about to code up.

426
00:36:56,840 --> 00:37:05,660
So step one, given all the reviews, we'll rank them minus 1 to 1.

427
00:37:05,660 --> 00:37:09,840
This is going to be our proxy for utility.

428
00:37:09,840 --> 00:37:21,720
Step two, we estimate a regression of utility on lab results interacted with the effects.

429
00:37:21,720 --> 00:37:22,920
Cool.

430
00:37:22,920 --> 00:37:24,880
We can do that.

431
00:37:24,880 --> 00:37:29,440
Step three, you now have a utility function.

432
00:37:29,440 --> 00:37:32,680
And so you can now put it to good use.

433
00:37:32,680 --> 00:37:45,280
And so for example, basically, I was reading on semantic or sentiment analysis, and they

434
00:37:45,280 --> 00:37:49,520
said a classic tool for this is recommendation engines.

435
00:37:49,520 --> 00:37:52,280
And so the idea is, OK, well, cool.

436
00:37:52,280 --> 00:37:54,680
So now you go to the store.

437
00:37:54,680 --> 00:37:58,080
You have a list of products.

438
00:37:58,080 --> 00:38:04,600
You plug all those products into your expected utility function.

439
00:38:04,600 --> 00:38:09,560
You use, say, the skunk effects model.

440
00:38:09,560 --> 00:38:15,960
So that way you have a nice predicted probability for all the effects.

441
00:38:15,960 --> 00:38:24,200
And so now you can get an expected utility for all of the products.

442
00:38:24,200 --> 00:38:28,160
And you let your app do all this heavy lifting.

443
00:38:28,160 --> 00:38:31,360
But basically, you go into the store.

444
00:38:31,360 --> 00:38:32,520
You've got your app.

445
00:38:32,520 --> 00:38:35,520
Your app knows all the products in the store.

446
00:38:35,520 --> 00:38:37,440
Bop, bop, bop, bop.

447
00:38:37,440 --> 00:38:45,880
And it just puts the product that would give you the highest expected utility at the top.

448
00:38:45,880 --> 00:38:55,400
And then rank those by highest utility downwards.

449
00:38:55,400 --> 00:38:56,400
Quick question.

450
00:38:56,400 --> 00:38:59,400
Maybe I don't understand exactly how this is working.

451
00:38:59,400 --> 00:39:03,680
But I thought when you used dummy variables, they were Boolean.

452
00:39:03,680 --> 00:39:06,480
So it's either 1 or 0.

453
00:39:06,480 --> 00:39:07,480
Yes.

454
00:39:07,480 --> 00:39:17,280
So essentially, this is saying that if you're sleepy, your utility is going to be different

455
00:39:17,280 --> 00:39:22,280
from the cannabinoids than when you're not sleepy.

456
00:39:22,280 --> 00:39:29,280
Does that make sense?

457
00:39:29,280 --> 00:39:37,560
So if we had 15 different effects that we were tracking, each one would have either

458
00:39:37,560 --> 00:39:40,200
a 1 or a 0?

459
00:39:40,200 --> 00:39:41,480
Exactly.

460
00:39:41,480 --> 00:39:47,880
So basically, what you're saying is there would be 15 states of reality.

461
00:39:47,880 --> 00:39:53,560
So it's basically like there's 15 states.

462
00:39:53,560 --> 00:39:57,280
You could be in a state where you're happy.

463
00:39:57,280 --> 00:40:00,840
You may be in a state where you're focused.

464
00:40:00,840 --> 00:40:08,160
You could potentially get complex with this and say a state where you're focused and happy

465
00:40:08,160 --> 00:40:11,160
is different than a state where you're just happy.

466
00:40:11,160 --> 00:40:12,160
Yeah.

467
00:40:12,160 --> 00:40:18,240
I would imagine that if you had 15 different columns, each one being an effect, each column,

468
00:40:18,240 --> 00:40:27,200
each row of data having a 1 or a 0 in it, you could have more than one 1 in each row.

469
00:40:27,200 --> 00:40:28,200
Exactly.

470
00:40:28,200 --> 00:40:34,320
And so it's going to add a lot of complexity to your model.

471
00:40:34,320 --> 00:40:40,280
But the way I would do this is I would call those different states.

472
00:40:40,280 --> 00:40:48,720
So basically, every single combination is a different state.

473
00:40:48,720 --> 00:40:56,120
And so you're going to end up with a lot of parameters really, really quickly if you start

474
00:40:56,120 --> 00:41:00,160
to add a lot of different effects.

475
00:41:00,160 --> 00:41:07,680
But that's okay because we can let our – remember degrees of freedom out of statistics.

476
00:41:07,680 --> 00:41:12,960
We just need a lot more observations than we have degrees of freedom.

477
00:41:12,960 --> 00:41:16,880
So if we don't have very many observations, we're going to chew up our degrees of freedom

478
00:41:16,880 --> 00:41:20,040
really, really quickly by adding effects.

479
00:41:20,040 --> 00:41:26,560
But the idea is these companies have thousands of reviews.

480
00:41:26,560 --> 00:41:33,120
So once you get thousands of reviews, you can put thousands of parameters in your model.

481
00:41:33,120 --> 00:41:40,800
So here I'll get out of abstract land and actually estimate this for you just to kind

482
00:41:40,800 --> 00:41:42,880
of ground it in reality.

483
00:41:42,880 --> 00:41:47,680
But I'm just going to do it with one effect today, sleepy.

484
00:41:47,680 --> 00:41:57,600
But the idea is it's perfectly generalizable so that way algorithms in coding is powerful.

485
00:41:57,600 --> 00:42:07,120
So it would be not that much work to generalize this to – I mean it would be some work.

486
00:42:07,120 --> 00:42:14,240
But you could generalize this to S number of effects.

487
00:42:14,240 --> 00:42:19,600
And then as long as you have enough data, then you can just kind of estimate a really,

488
00:42:19,600 --> 00:42:23,560
really robust model for expected utility.

489
00:42:23,560 --> 00:42:31,080
And is skunk effects going to be how this data is generated?

490
00:42:31,080 --> 00:42:33,800
Is that what you're thinking?

491
00:42:33,800 --> 00:42:40,200
I was thinking about using skunk effects essentially – so basically what skunk effects predicts

492
00:42:40,200 --> 00:42:42,400
is P.

493
00:42:42,400 --> 00:42:50,960
So you'll need some way to estimate P because you don't actually know the true P.

494
00:42:50,960 --> 00:42:53,320
So there's different ways you can do this.

495
00:42:53,320 --> 00:42:57,480
So you could just do basically the average.

496
00:42:57,480 --> 00:43:04,680
So what's the mean occurrence of sleepy?

497
00:43:04,680 --> 00:43:09,980
But the idea is skunk effects is a conditional mean.

498
00:43:09,980 --> 00:43:15,000
So skunk effects would be a mean that's conditional on X.

499
00:43:15,000 --> 00:43:18,280
So this would be P of X.

500
00:43:18,280 --> 00:43:25,560
So your probability of your state depends on the lab results, which is a pretty reasonable

501
00:43:25,560 --> 00:43:28,200
assumption.

502
00:43:28,200 --> 00:43:32,600
So the way that you can incorporate skunk effects into this is you would just say this

503
00:43:32,600 --> 00:43:34,400
is P of X.

504
00:43:34,400 --> 00:43:42,120
And so you estimate P of X with skunk effects, and then you estimate your ugliest utility

505
00:43:42,120 --> 00:43:46,620
function, which is just an ordinary least squares regression.

506
00:43:46,620 --> 00:43:50,040
And then you basically combine the two models.

507
00:43:50,040 --> 00:43:53,160
And then as I said, it's going to be complex.

508
00:43:53,160 --> 00:43:59,040
But you're estimating utility.

509
00:43:59,040 --> 00:44:06,880
This is an abstract measure of happiness, like how happy, quote unquote, is somebody

510
00:44:06,880 --> 00:44:09,120
from this cannabis.

511
00:44:09,120 --> 00:44:12,520
So this is a really, really complex function.

512
00:44:12,520 --> 00:44:20,440
So it's not unreasonable for it to kind of take this complex form.

513
00:44:20,440 --> 00:44:21,440
Right?

514
00:44:21,440 --> 00:44:22,440
And that's why...

515
00:44:22,440 --> 00:44:28,240
I'm not just concerned about where you're going to get all the observations from.

516
00:44:28,240 --> 00:44:29,840
I'll show you right now.

517
00:44:29,840 --> 00:44:33,960
So long story short, but I'll just leave this.

518
00:44:33,960 --> 00:44:41,960
My last comment was, Ray, we're writing a complex algorithm that's doing the work of

519
00:44:41,960 --> 00:44:44,840
a human brain.

520
00:44:44,840 --> 00:44:47,120
The human brain is powerful.

521
00:44:47,120 --> 00:44:57,460
So the human brain sees a list of products and boom, just like that, the human brain

522
00:44:57,460 --> 00:45:06,760
estimates P. The human brain estimates utility in all those different states and then estimates

523
00:45:06,760 --> 00:45:10,480
an expected utility from all those products.

524
00:45:10,480 --> 00:45:17,040
And so that's why you may look at a menu for a little while because your brain is kind

525
00:45:17,040 --> 00:45:18,960
of churning.

526
00:45:18,960 --> 00:45:23,720
That is centrally the ideas...

527
00:45:23,720 --> 00:45:28,920
The idea behind the theory is this is kind of what's going on in your brain.

528
00:45:28,920 --> 00:45:36,100
So if we can replicate the same process that your brain's using and let the computer do

529
00:45:36,100 --> 00:45:45,520
all the heavy lifting, you can save your brain power to do more interesting things.

530
00:45:45,520 --> 00:45:54,600
And then pick which product at the store is going to be the best for you.

531
00:45:54,600 --> 00:46:00,240
You can spend your time doing mathematics and statistics and fun things like that.

532
00:46:00,240 --> 00:46:02,540
You can attend Saturday morning statistics.

533
00:46:02,540 --> 00:46:05,320
So what did you get?

534
00:46:05,320 --> 00:46:07,680
So what do we get?

535
00:46:07,680 --> 00:46:10,720
Well, we are.

536
00:46:10,720 --> 00:46:16,520
So we've got our reviews and just to go ahead and do everything that Charles would...

537
00:46:16,520 --> 00:46:21,920
Whoops, I should have showed you this beforehand.

538
00:46:21,920 --> 00:46:27,800
Okay, so I accidentally already deleted them.

539
00:46:27,800 --> 00:46:32,360
But Charles pointed...

540
00:46:32,360 --> 00:46:33,360
Or someone had pointed...

541
00:46:33,360 --> 00:46:37,200
Or maybe John, you had pointed out there were a lot of duplicates.

542
00:46:37,200 --> 00:46:44,080
And so I actually added a user column.

543
00:46:44,080 --> 00:46:51,480
So I went through and parsed the data again and got all the users.

544
00:46:51,480 --> 00:46:56,760
And then I was looking at these two columns and there were just multiple times where the

545
00:46:56,760 --> 00:47:01,240
same user had left the same review.

546
00:47:01,240 --> 00:47:11,680
So I'm sorry that I just now caught this, but there's almost 20,000 duplicates.

547
00:47:11,680 --> 00:47:15,600
So my apologies that I caught them so late.

548
00:47:15,600 --> 00:47:18,540
Our statistics up until now are biased.

549
00:47:18,540 --> 00:47:22,880
So it would be worthwhile to go back and recalculate those.

550
00:47:22,880 --> 00:47:26,360
But the Python philosophy is now is better than never.

551
00:47:26,360 --> 00:47:34,560
So we at least identified this now, and so the point being is by the end of today, by

552
00:47:34,560 --> 00:47:40,040
the end of this script, our data is going to be really, really robust.

553
00:47:40,040 --> 00:47:45,120
Or at least it's more robust than it was.

554
00:47:45,120 --> 00:47:53,000
As I said, and I'll get through this in the next five minutes, but basically please forgive

555
00:47:53,000 --> 00:47:59,280
the shortcomings in the data and just let this be an educational example for you.

556
00:47:59,280 --> 00:48:05,980
These same statistics, the same model can be used with any review data.

557
00:48:05,980 --> 00:48:11,920
So I highly encourage you to use your own reviews that specifically match up to specific

558
00:48:11,920 --> 00:48:12,920
products.

559
00:48:12,920 --> 00:48:16,280
And that data is pure gold.

560
00:48:16,280 --> 00:48:21,320
I'm just kind of using this dummy data for educational purposes.

561
00:48:21,320 --> 00:48:24,480
Can you screen for bots?

562
00:48:24,480 --> 00:48:25,480
Say that one more time.

563
00:48:25,480 --> 00:48:30,440
Can you screen for bots who are artificially non-human reviews?

564
00:48:30,440 --> 00:48:35,920
Brilliant question slash observation, Jerry.

565
00:48:35,920 --> 00:48:39,720
So yes, so now let's look at this data.

566
00:48:39,720 --> 00:48:44,680
So okay, so now how many unique users are there?

567
00:48:44,680 --> 00:48:50,080
Like are they just all the same person?

568
00:48:50,080 --> 00:48:59,280
So what we find is there's fewer users than there are reviews, which is actually awesome

569
00:48:59,280 --> 00:49:04,880
because this means the same user is leaving multiple reviews.

570
00:49:04,880 --> 00:49:14,280
So is there a question?

571
00:49:14,280 --> 00:49:22,000
So what if the same user is leaving reviews for different strengths?

572
00:49:22,000 --> 00:49:25,080
That is exactly what I think we want.

573
00:49:25,080 --> 00:49:30,760
So for example, so here this is a bit more logical.

574
00:49:30,760 --> 00:49:37,760
So here I would just like, okay, let's count the reviews by user.

575
00:49:37,760 --> 00:49:43,780
And so you see, okay, anonymous is leaving the most reviews.

576
00:49:43,780 --> 00:49:53,520
So I actually end up excluding those because I heard people kind of bash on anonymous reviews.

577
00:49:53,520 --> 00:49:59,440
So we may or may not want to use them.

578
00:49:59,440 --> 00:50:08,020
But long story short is most people are leaving one review and that's it.

579
00:50:08,020 --> 00:50:17,040
So more than 75% of people are just leaving one review and calling it a day.

580
00:50:17,040 --> 00:50:24,200
But here I just picked this person out just because they had Washington in their name.

581
00:50:24,200 --> 00:50:34,280
But if you see the strains that this person's reviewing, it actually gets real interesting

582
00:50:34,280 --> 00:50:35,680
because check this out.

583
00:50:35,680 --> 00:50:41,640
So this person reviewed 24 karat gold, cookies push.

584
00:50:41,640 --> 00:50:49,200
They reviewed DJ Short Blueberry three times.

585
00:50:49,200 --> 00:50:54,340
So these are three different reviews.

586
00:50:54,340 --> 00:51:02,160
So one may think that maybe they got this product three times.

587
00:51:02,160 --> 00:51:09,360
But they're just a pretty avid reviewer.

588
00:51:09,360 --> 00:51:15,520
Let's see if we can't look at one more here.

589
00:51:15,520 --> 00:51:22,760
And then I'll get to the statistics since we're running low on time here.

590
00:51:22,760 --> 00:51:25,440
This introduces a couple things.

591
00:51:25,440 --> 00:51:30,520
One, it does introduce a potential source of bias.

592
00:51:30,520 --> 00:51:42,240
If one user is leaving a disproportionate number of reviews, then that could bias things

593
00:51:42,240 --> 00:51:47,720
towards that one reviewer's preferences.

594
00:51:47,720 --> 00:51:53,240
For example, Chill Panda has left all these reviews.

595
00:51:53,240 --> 00:52:03,600
We may just be really, really good at predicting Chill Panda's effects.

596
00:52:03,600 --> 00:52:12,920
And so that's actually okay because that's what statistics is for in that statistics

597
00:52:12,920 --> 00:52:22,480
is all about uncertainty and how do we actually use this variability to our advantage.

598
00:52:22,480 --> 00:52:41,920
And I think this is actually awesome because now we can actually condition effects on the

599
00:52:41,920 --> 00:52:42,920
user.

600
00:52:42,920 --> 00:52:48,800
And so John, you brought this up real briefly at the end of the last meetup, and I'm not

601
00:52:48,800 --> 00:52:52,960
sure if I fully sunk home with me.

602
00:52:52,960 --> 00:53:00,800
But basically, different people may have different effects.

603
00:53:00,800 --> 00:53:05,400
Different people may gravitate towards different strains.

604
00:53:05,400 --> 00:53:11,120
So for example, I just want to go ahead and show you this data real quick.

605
00:53:11,120 --> 00:53:17,720
So basically, just I'm excluding anonymous.

606
00:53:17,720 --> 00:53:24,680
I'm keeping only the observations with these compounds of interest that were detected.

607
00:53:24,680 --> 00:53:31,000
And we're left with a training sample of still around 4,000.

608
00:53:31,000 --> 00:53:44,640
So we started with what, 42,000 reviews, but then we had to remove duplicates.

609
00:53:44,640 --> 00:53:47,920
We had to remove things with non-detects.

610
00:53:47,920 --> 00:53:50,760
So there's a lot of noise in the data.

611
00:53:50,760 --> 00:53:58,520
But once we remove all the extraneous noise and take a random sample, we've got a good

612
00:53:58,520 --> 00:54:00,160
training set.

613
00:54:00,160 --> 00:54:05,680
We've got 4,000 nice random samples here.

614
00:54:05,680 --> 00:54:10,480
And I think we can do some good work with this training set, this sample.

615
00:54:10,480 --> 00:54:19,160
Keegan, this is heavily skewed to the low beta-pinene-limonene group.

616
00:54:19,160 --> 00:54:25,040
This is, actually, I love your take on this, because I wasn't sure what to make of this

617
00:54:25,040 --> 00:54:26,040
plot.

618
00:54:26,040 --> 00:54:27,860
Yeah, this looks pretty skewed to me.

619
00:54:27,860 --> 00:54:36,860
So what I did here was I averaged all the lab results by user.

620
00:54:36,860 --> 00:54:41,660
So this is basically a user's profile.

621
00:54:41,660 --> 00:54:52,080
So this is sort of the average beta-pinene to D-limonene that a consumer chooses.

622
00:54:52,080 --> 00:54:55,960
So they may choose samples all over the board.

623
00:54:55,960 --> 00:55:06,400
But on average, somebody over here, they may have a predilection towards indica-type strains.

624
00:55:06,400 --> 00:55:11,640
And so each of these dots corresponds to a specific user.

625
00:55:11,640 --> 00:55:18,160
There's no mapping between the user and the lab results.

626
00:55:18,160 --> 00:55:22,440
There's a mapping between the strain and the lab results.

627
00:55:22,440 --> 00:55:29,840
But there are several lab results for each strain.

628
00:55:29,840 --> 00:55:39,280
And so if you just average that down, then you're sort of compressing the data.

629
00:55:39,280 --> 00:55:41,360
You're 100% correct.

630
00:55:41,360 --> 00:55:50,200
And so basically, I'm using the average lab results as a proxy for the actual lab results

631
00:55:50,200 --> 00:55:52,360
of this person's product.

632
00:55:52,360 --> 00:55:59,080
So that's why I was saying, if you actually have the lab results of the product the person

633
00:55:59,080 --> 00:56:03,400
consumed, that is the match you want.

634
00:56:03,400 --> 00:56:09,680
So I'm just kind of playing pretend that we have that match.

635
00:56:09,680 --> 00:56:15,760
So you see, we're kind of taking a leap of faith.

636
00:56:15,760 --> 00:56:21,120
It's not perfect, but I don't want to say I see other economists do this.

637
00:56:21,120 --> 00:56:31,240
But it's basically the variant of interest is the compounds in the cannabis strain that

638
00:56:31,240 --> 00:56:34,520
someone consumed and reviewed.

639
00:56:34,520 --> 00:56:35,840
That's missing.

640
00:56:35,840 --> 00:56:38,080
So we have to estimate that.

641
00:56:38,080 --> 00:56:41,880
And so we're estimating it with the average lab results.

642
00:56:41,880 --> 00:56:45,660
So it's way, way far from perfect.

643
00:56:45,660 --> 00:57:06,120
But the idea is, oh, if you look at the strains that Green Green Washington's picking, are

644
00:57:06,120 --> 00:57:11,720
these Exodus Cheese, Marion, Mary Cush?

645
00:57:11,720 --> 00:57:21,400
On average, are these more this high beta-pinene to d-lemonine type, slash sativa?

646
00:57:21,400 --> 00:57:26,400
Or are they on average more indica type?

647
00:57:26,400 --> 00:57:33,920
So I think you're...

648
00:57:33,920 --> 00:57:39,920
By compressing this, you're losing a lot of information and you're repeating the same

649
00:57:39,920 --> 00:57:43,200
information over and over again.

650
00:57:43,200 --> 00:57:50,960
So there's 4,000 some odd reviews and there's around 100...

651
00:57:50,960 --> 00:57:58,720
After you compress the data and remove all the duplicates, there's maybe like 100 different

652
00:57:58,720 --> 00:57:59,720
strains.

653
00:57:59,720 --> 00:58:00,720
184, three.

654
00:58:00,720 --> 00:58:01,720
183.

655
00:58:01,720 --> 00:58:02,720
Yeah.

656
00:58:02,720 --> 00:58:11,000
So it's... Charles, can I...

657
00:58:11,000 --> 00:58:13,240
Can we make a date to look offline?

658
00:58:13,240 --> 00:58:19,720
Because I want to show you, I think that that 183 set is extremely well behaved.

659
00:58:19,720 --> 00:58:26,800
I can try and convince you of that through principal component analysis and loading plots

660
00:58:26,800 --> 00:58:29,400
that we've done.

661
00:58:29,400 --> 00:58:37,160
So I am of the opinion that that 183 mean data set has value.

662
00:58:37,160 --> 00:58:38,160
Okay.

663
00:58:38,160 --> 00:58:39,160
Yeah.

664
00:58:39,160 --> 00:58:42,160
I'd like to see it.

665
00:58:42,160 --> 00:58:48,440
I need to walk you through the principal component analysis results and let's make a date for

666
00:58:48,440 --> 00:58:49,440
that.

667
00:58:49,440 --> 00:58:54,240
I'll see if I can convince you because there's a lot of story in it that I can describe to

668
00:58:54,240 --> 00:58:55,240
you.

669
00:58:55,240 --> 00:58:56,240
Okay.

670
00:58:56,240 --> 00:58:57,240
That would be great.

671
00:58:57,240 --> 00:58:58,240
I really like that.

672
00:58:58,240 --> 00:58:59,240
Yeah.

673
00:58:59,240 --> 00:59:04,320
Again, I come at this from the biochemistry.

674
00:59:04,320 --> 00:59:06,600
So just to...

675
00:59:06,600 --> 00:59:13,160
So basically, you're 100% right, Charles, but basically I don't want to let this stop

676
00:59:13,160 --> 00:59:22,240
us from fitting this model because the model can be used by any single company that's got

677
00:59:22,240 --> 00:59:32,080
reviews and lab results and thousands of company, or maybe not thousands, but all the big apps

678
00:59:32,080 --> 00:59:37,200
have reviews and lab results, one-to-one matches.

679
00:59:37,200 --> 00:59:46,260
So we don't have that data, but we can make this model and then weed maps, Leafly, any

680
00:59:46,260 --> 00:59:51,080
of these big companies could use this algorithm.

681
00:59:51,080 --> 01:00:01,240
And I argue just because we're missing the data, it's still a pretty good proxy because

682
01:00:01,240 --> 01:00:19,960
basically what I'm saying here is this person, they sampled DJ short blueberry three times.

683
01:00:19,960 --> 01:00:24,360
One of the three times it made them sleepy.

684
01:00:24,360 --> 01:00:36,520
And so it would be nice to know the total THC of each DJ short blueberry they sampled.

685
01:00:36,520 --> 01:00:40,400
But we've even heard all this noise.

686
01:00:40,400 --> 01:00:44,280
Go talk to any cultivator and they'll...

687
01:00:44,280 --> 01:00:51,000
Not any, but most cultivators will just complain to you all day long about how their lab results

688
01:00:51,000 --> 01:00:55,280
aren't representative of their product.

689
01:00:55,280 --> 01:00:59,840
And they say that's a problem because they say, oh, we send in one sample and we get

690
01:00:59,840 --> 01:01:04,120
one lab result back, and then that's what goes on the label.

691
01:01:04,120 --> 01:01:12,360
So one could even make the argument that the average is even a better metric than what's

692
01:01:12,360 --> 01:01:14,200
on the label.

693
01:01:14,200 --> 01:01:19,120
I don't think I would go that far.

694
01:01:19,120 --> 01:01:29,160
But the long story short is we're basically saying this strain has on average 19.5% THC.

695
01:01:29,160 --> 01:01:34,880
It may make you sleepy a third of the time.

696
01:01:34,880 --> 01:01:40,560
We'll have to let the data pan that out.

697
01:01:40,560 --> 01:01:44,360
And then it kind of goes on with the other variables.

698
01:01:44,360 --> 01:01:56,340
So the fact that the lab results repeat is okay as long as you don't have perfect collinearity.

699
01:01:56,340 --> 01:02:05,800
So as long as people are experiencing different effects.

700
01:02:05,800 --> 01:02:11,400
I'm just wondering if it would help if instead of doing natural language processing, when

701
01:02:11,400 --> 01:02:18,280
we gather the data, we use like drop down menus, like radio bullets.

702
01:02:18,280 --> 01:02:21,920
What effect did you experience from this product?

703
01:02:21,920 --> 01:02:27,880
Jerry, that's exactly what we're doing in our dosing project work.

704
01:02:27,880 --> 01:02:34,000
It will solve a lot of this because it will be a given user matched to a given strain

705
01:02:34,000 --> 01:02:36,800
with a given COA and a given effect.

706
01:02:36,800 --> 01:02:41,840
So what we're doing now is kind of modeling off of what Keyin is presenting.

707
01:02:41,840 --> 01:02:47,840
But we're going to have access, God willing, to a much better data set.

708
01:02:47,840 --> 01:02:52,200
That's precisely the approach we're taking.

709
01:02:52,200 --> 01:02:55,640
I think I may have ruffled a few feathers, but that's good because...

710
01:02:55,640 --> 01:02:58,000
No, you're leading the way.

711
01:02:58,000 --> 01:03:01,920
I mean, it's just a lot of twists and turns.

712
01:03:01,920 --> 01:03:02,920
Yeah.

713
01:03:02,920 --> 01:03:07,240
Well, like I said, we're estimating a utility function here.

714
01:03:07,240 --> 01:03:10,920
So it's not just a trivial thing to do.

715
01:03:10,920 --> 01:03:13,400
But here, I'll let you get out of here.

716
01:03:13,400 --> 01:03:20,400
But real quick, I'll just show you how the model would go about working if this was something

717
01:03:20,400 --> 01:03:23,040
that you so desired.

718
01:03:23,040 --> 01:03:34,600
So long story short is you get someone to review here.

719
01:03:34,600 --> 01:03:42,080
So I biked to a dispensary, picked up a lot of weed, da, da, da, da.

720
01:03:42,080 --> 01:03:49,240
And so you'd normally have to pay somebody to read through this and figure out how happy

721
01:03:49,240 --> 01:03:51,640
somebody was from this.

722
01:03:51,640 --> 01:04:00,880
And we can basically just let this natural language toolkit basically give us a ranking

723
01:04:00,880 --> 01:04:10,960
from minus one to one on how positive somebody's experience was.

724
01:04:10,960 --> 01:04:22,520
So this person, it looks like, had quite a positive experience.

725
01:04:22,520 --> 01:04:30,480
And this person may have had less of a positive experience.

726
01:04:30,480 --> 01:04:34,760
So I think this part itself could probably be tailored in.

727
01:04:34,760 --> 01:04:39,160
Like I said, I'm just kind of picking up this tool and running with it.

728
01:04:39,160 --> 01:04:53,520
But the idea is you could basically rank all of these reviews from minus one to what?

729
01:04:53,520 --> 01:04:58,320
So that way, we can put it in a continuous space.

730
01:04:58,320 --> 01:05:02,040
And we're not having to deal with zeros and ones.

731
01:05:02,040 --> 01:05:04,920
We have a nice continuum.

732
01:05:04,920 --> 01:05:10,300
Instead of just saying, oh, I was happy or I was sleepy, we actually can kind of get

733
01:05:10,300 --> 01:05:14,680
an idea of how happy you were.

734
01:05:14,680 --> 01:05:22,080
And notice it's heavily biased towards positive effects.

735
01:05:22,080 --> 01:05:27,100
And so this is kind of something I talked about where if you're going to be leaving

736
01:05:27,100 --> 01:05:33,480
a review, it's probably because you had a positive experience.

737
01:05:33,480 --> 01:05:35,600
Or maybe you had a really negative experience.

738
01:05:35,600 --> 01:05:44,760
But chances are, if you're going to Leafly and leaving a review, you probably have a

739
01:05:44,760 --> 01:05:47,920
positive opinion on this.

740
01:05:47,920 --> 01:05:51,400
So long story short, make of it what you will.

741
01:05:51,400 --> 01:06:01,760
But you can then regress this ranking on the log of all these compounds.

742
01:06:01,760 --> 01:06:05,720
So I know the data is not perfect.

743
01:06:05,720 --> 01:06:16,200
But if you had your own data, your R squared may be much higher than our meager, measly

744
01:06:16,200 --> 01:06:17,440
R squared.

745
01:06:17,440 --> 01:06:26,760
So yes, our data is, as Charles put it, kind of compiled from hell.

746
01:06:26,760 --> 01:06:34,120
Charles didn't say it like that, but I said it like that.

747
01:06:34,120 --> 01:06:35,840
That's my training as an economist.

748
01:06:35,840 --> 01:06:38,000
So judge me how you will.

749
01:06:38,000 --> 01:06:44,360
But long story short is make of it what you will.

750
01:06:44,360 --> 01:06:53,960
But the way I interpret this is, okay, someone may have a baseline happiness of around.45

751
01:06:53,960 --> 01:06:56,800
on the minus one to one scale.

752
01:06:56,800 --> 01:07:07,480
And then what this would basically say would be, for each 1% increase in total THC, you

753
01:07:07,480 --> 01:07:18,080
would move.06 distance towards one.

754
01:07:18,080 --> 01:07:27,280
So the idea is an increase of total THC would increase your utility.

755
01:07:27,280 --> 01:07:31,080
I'm not even going to read into the statistical significance of this.

756
01:07:31,080 --> 01:07:38,080
It looks like CBD may, but once again, the coefficient is really small.

757
01:07:38,080 --> 01:07:42,520
So CBD may increase your happiness.

758
01:07:42,520 --> 01:07:49,640
And then basically what I'm seeing is, once again, the coefficients on these are really

759
01:07:49,640 --> 01:07:50,640
low.

760
01:07:50,640 --> 01:08:02,200
But it could be that increasing your D-lemonine, beta-karyophylline, and humuline may actually

761
01:08:02,200 --> 01:08:08,640
decrease people's positive review.

762
01:08:08,640 --> 01:08:20,040
And then just real, real quick, this is the expected utility function just to estimate

763
01:08:20,040 --> 01:08:21,040
everything.

764
01:08:21,040 --> 01:08:26,400
But kind of once again, you're seeing the same thing.

765
01:08:26,400 --> 01:08:34,360
But then in this case, you're generally – if somebody does get sleepy, they may get slightly

766
01:08:34,360 --> 01:08:37,640
less utility from these compounds.

767
01:08:37,640 --> 01:08:39,720
Hey, Keegan?

768
01:08:39,720 --> 01:08:40,720
Yes?

769
01:08:40,720 --> 01:08:48,160
A really critical point that we deal with in our own discussions on this is, if you

770
01:08:48,160 --> 01:08:55,440
pick sleepy as one of your indicators, there's a large group of people that use cannabis

771
01:08:55,440 --> 01:08:57,120
to go to sleep.

772
01:08:57,120 --> 01:09:00,080
And so that is a desired outcome.

773
01:09:00,080 --> 01:09:05,440
You have to really query for indication for what the intent was.

774
01:09:05,440 --> 01:09:10,040
And I think you have to build that into the model a priori that I didn't hear you really

775
01:09:10,040 --> 01:09:15,640
– I don't think that's quite there yet, or maybe I missed it.

776
01:09:15,640 --> 01:09:22,280
But I think you need to put – you have to query and put the intent by respondent into

777
01:09:22,280 --> 01:09:23,280
this model.

778
01:09:23,280 --> 01:09:25,880
And you hit the nail on the head, John.

779
01:09:25,880 --> 01:09:28,200
So thank you for bringing this up.

780
01:09:28,200 --> 01:09:31,760
This is the exact extension of the model that's needed.

781
01:09:31,760 --> 01:09:36,480
So if you're interested, sign up for Saturday Morning Statistics because that's when we'll

782
01:09:36,480 --> 01:09:37,480
be doing it.

783
01:09:37,480 --> 01:09:46,240
But basically, it's a fairly trivial extension of this model to basically let the parameters

784
01:09:46,240 --> 01:09:54,720
alpha vary by – you could let them vary in a couple different ways, but potentially,

785
01:09:54,720 --> 01:10:00,440
we could just let them vary by individual.

786
01:10:00,440 --> 01:10:05,440
You could do types of individual, but if we've got enough data, we could basically let the

787
01:10:05,440 --> 01:10:10,640
parameters vary by – I want to say we could – I don't know if we're going to have

788
01:10:10,640 --> 01:10:11,640
enough data.

789
01:10:11,640 --> 01:10:19,840
We may have to – I was thinking that what you can do is basically you could restrict

790
01:10:19,840 --> 01:10:28,720
your sample to the people that leave a lot of reviews and then let the parameters vary

791
01:10:28,720 --> 01:10:35,360
by user, and then you could basically get user-specific effects.

792
01:10:35,360 --> 01:10:37,960
And so then you'd be exactly right.

793
01:10:37,960 --> 01:10:47,480
And so then some users may get a really good effect from THC and then some wouldn't.

794
01:10:47,480 --> 01:10:52,200
Some users may get a really good effect from CBD and others wouldn't.

795
01:10:52,200 --> 01:10:56,600
And then it could handle all the different states.

796
01:10:56,600 --> 01:11:03,880
So then you could let it – some users really like the sleepy effect, some not like the

797
01:11:03,880 --> 01:11:05,520
sleepy effect.

798
01:11:05,520 --> 01:11:13,280
And so I think it kind of captures everything, right, because it's basically saying it

799
01:11:13,280 --> 01:11:16,240
ties it nicely in with economic theory.

800
01:11:16,240 --> 01:11:21,260
You've got consumers trying to maximize their utility from cannabis.

801
01:11:21,260 --> 01:11:22,760
How do they do that?

802
01:11:22,760 --> 01:11:30,280
Well, it's actually from the compounds in the cannabis and it's conditional interacting

803
01:11:30,280 --> 01:11:37,320
with the effects that they experience.

804
01:11:37,320 --> 01:11:43,320
And it varies person by person at a biochemistry level.

805
01:11:43,320 --> 01:11:44,320
So I think –

806
01:11:44,320 --> 01:11:52,720
So it's a parameter that also – the response group type I think is a different parameter

807
01:11:52,720 --> 01:11:54,560
from the intent parameter.

808
01:11:54,560 --> 01:11:58,240
I think you need two of them at least.

809
01:11:58,240 --> 01:12:01,840
Basically alpha is going to get alpha 1i.

810
01:12:01,840 --> 01:12:02,840
Okay.

811
01:12:02,840 --> 01:12:03,840
Okay.

812
01:12:03,840 --> 01:12:13,360
So as I said, tune in to Saturday morning if you really want like the hardcore statistics

813
01:12:13,360 --> 01:12:14,360
on this.

814
01:12:14,360 --> 01:12:20,480
But essentially, as I said, I just kind of thought of this last night.

815
01:12:20,480 --> 01:12:25,080
So this still needs a lot of polishing.

816
01:12:25,080 --> 01:12:33,200
But the idea is – I mean if you can estimate expected utility from a cannabis product,

817
01:12:33,200 --> 01:12:37,480
of course the model is going to need a lot more refinement and you can be creative.

818
01:12:37,480 --> 01:12:42,840
And that's why I always say this is your chance to be creative in X is think about

819
01:12:42,840 --> 01:12:50,680
all the different variants that matter, all the different biochemistries that matter,

820
01:12:50,680 --> 01:12:54,800
all the different states of nature that matter.

821
01:12:54,800 --> 01:13:02,160
So you can be creative in X, but the idea is this is a fairly general model.

822
01:13:02,160 --> 01:13:07,680
And as I said, this could help anyone who's got a recommendation engine.

823
01:13:07,680 --> 01:13:13,400
It could help people thinking about what they're going to grow.

824
01:13:13,400 --> 01:13:19,440
And it could also just help scientists and consumers just kind of understand how does

825
01:13:19,440 --> 01:13:21,840
the brain work, right?

826
01:13:21,840 --> 01:13:24,600
Because how do people make decisions?

827
01:13:24,600 --> 01:13:27,780
How do compounds affect people?

828
01:13:27,780 --> 01:13:31,280
I think the jury's still out on a lot of this.

829
01:13:31,280 --> 01:13:37,320
And I think this model really is a big step in the right direction.

830
01:13:37,320 --> 01:13:45,520
So thank you all for your tough criticism or your tough remarks because no model's

831
01:13:45,520 --> 01:13:46,520
perfect.

832
01:13:46,520 --> 01:13:49,040
This model is far from perfect.

833
01:13:49,040 --> 01:13:52,080
The data is way far from perfect.

834
01:13:52,080 --> 01:13:56,760
So there's a lot of improvement all across the board that needs to be done.

835
01:13:56,760 --> 01:14:02,960
That's why I call upon you to help contribute because I think we're onto something big

836
01:14:02,960 --> 01:14:03,960
here.

837
01:14:03,960 --> 01:14:11,360
It's just as Charles and John and everyone had pointed out, there's so many intricacies

838
01:14:11,360 --> 01:14:17,920
and ways that you can misstep and ways that you should be doing this correct, like a proper

839
01:14:17,920 --> 01:14:18,960
data scientist.

840
01:14:18,960 --> 01:14:28,480
So I'm calling on all of you for your help because I think we could be onto essentially

841
01:14:28,480 --> 01:14:36,920
the best model of cannabis consumption that I've seen.

842
01:14:36,920 --> 01:14:41,360
So if any of you want to help contribute, then message me.

843
01:14:41,360 --> 01:14:42,520
Join the Slack channel.

844
01:14:42,520 --> 01:14:43,520
It's all hands on deck.

845
01:14:43,520 --> 01:14:45,960
But I'll leave that with you.

846
01:14:45,960 --> 01:14:52,640
And my two insights from this week were data is really just hiding everywhere.

847
01:14:52,640 --> 01:14:54,760
You just have to look for it.

848
01:14:54,760 --> 01:14:56,800
And then this synthetic data.

849
01:14:56,800 --> 01:14:59,080
You've got to be a little careful with it.

850
01:14:59,080 --> 01:15:03,800
But personally, I think it can be quite useful.

851
01:15:03,800 --> 01:15:08,240
So I know it went way, way, way, way, way over time.

852
01:15:08,240 --> 01:15:14,120
But I thought this material was hopefully worth your while.

853
01:15:14,120 --> 01:15:19,680
So any last thoughts, comments, questions before we finally let you get out of here

854
01:15:19,680 --> 01:15:24,320
and enjoy your day?

855
01:15:24,320 --> 01:15:32,280
I think defining those alphas is going to be an important exercise.

856
01:15:32,280 --> 01:15:34,560
We certainly talk about timing.

857
01:15:34,560 --> 01:15:36,320
We talk about dose.

858
01:15:36,320 --> 01:15:42,240
We talk about content as the three main parameters that we're kind of focused on.

859
01:15:42,240 --> 01:15:47,640
And what primarily your discussion today is, is on content.

860
01:15:47,640 --> 01:15:52,200
And we need to be able to get to the questions of timing and dose.

861
01:15:52,200 --> 01:15:59,600
So this data set may not have it for us, but others might, or those that we generate.

862
01:15:59,600 --> 01:16:00,600
Exactly.

863
01:16:00,600 --> 01:16:06,360
Basically, we'll be able to maybe assign random effects for the users.

864
01:16:06,360 --> 01:16:19,920
So as I said, it's not going to be too, too useful in interpreting biochemistry type effects.

865
01:16:19,920 --> 01:16:24,120
The framework's at least in place.

866
01:16:24,120 --> 01:16:31,840
So like you, if you get that data, then you're welcome to use this framework and actually

867
01:16:31,840 --> 01:16:34,840
be able to make those inferences.

868
01:16:34,840 --> 01:16:40,720
Well one thing I'm going to try, if we can connect and I can get your user sorted data,

869
01:16:40,720 --> 01:16:46,400
at least the way you presented it today, I think I can query to see if we can identify

870
01:16:46,400 --> 01:16:49,640
the response type of the reviewer.

871
01:16:49,640 --> 01:16:53,800
That's something that came out of, I'm thinking, watching what you did today.

872
01:16:53,800 --> 01:17:00,240
And I'll give that a try if I can get that data, if I can get that data set.

873
01:17:00,240 --> 01:17:01,640
Definitely.

874
01:17:01,640 --> 01:17:06,000
And let's all stay in touch and keep moving the ball forward.

875
01:17:06,000 --> 01:17:10,520
Because I hope, I think that we're advancing cannabis science.

876
01:17:10,520 --> 01:17:17,640
So let's keep our noses to the grindstone, have fun while we're doing it, and get out

877
01:17:17,640 --> 01:17:18,640
of here and enjoy your time.

878
01:17:18,640 --> 01:17:19,640
Get the right strain.

879
01:17:19,640 --> 01:17:49,560
Get the right strain everyone.

