1
00:00:00,000 --> 00:00:08,360
Welcome to the Cannabis Data Science Meetup.

2
00:00:08,360 --> 00:00:15,600
As always, the team of data scientists and brilliant minds from around the world coming

3
00:00:15,600 --> 00:00:19,080
here to advance cannabis science.

4
00:00:19,080 --> 00:00:25,700
Even if it's only one molecule at a time, bit by bit, we're moving forward and got some

5
00:00:25,700 --> 00:00:28,720
cool statistics to share with you today.

6
00:00:28,720 --> 00:00:36,920
We'd be kind of curious to just talk cannabis data with you real quick if you are interested.

7
00:00:36,920 --> 00:00:44,960
So Ruth, I saw that you were working on the cannabis license data and that's kind of pertinent

8
00:00:44,960 --> 00:00:54,520
to just the general discussion of cannabis data with it's a bit of a mysterious industry

9
00:00:54,520 --> 00:00:59,120
after all and it's tough to come by quality data.

10
00:00:59,120 --> 00:01:03,680
So we'd love to get some of your thoughts that you may want to put on the table before

11
00:01:03,680 --> 00:01:05,880
I just carry on too much.

12
00:01:05,880 --> 00:01:09,360
That way we can have a nice back and forth.

13
00:01:09,360 --> 00:01:14,040
I had done an analysis a couple years ago looking at the license data and I was really

14
00:01:14,040 --> 00:01:20,560
interested in understanding what the type of structures of ownership of the different

15
00:01:20,560 --> 00:01:22,440
licenses were across states.

16
00:01:22,440 --> 00:01:29,560
So in other words, what amount of horizontal and vertical integration is there?

17
00:01:29,560 --> 00:01:34,400
Now certain states have restrictions so some require vertical integration and some actually

18
00:01:34,400 --> 00:01:35,960
prohibit it.

19
00:01:35,960 --> 00:01:43,160
But I was just interested in, I was thinking kind of from an economic standpoint.

20
00:01:43,160 --> 00:01:47,800
If you think of technology can flow across state lines and know how can flow across state

21
00:01:47,800 --> 00:01:59,200
lines and the economics of non-cannabis inputs should be relatively similar across state lines

22
00:01:59,200 --> 00:02:06,560
and in all of those cases you would expect the organization of the industry, this is

23
00:02:06,560 --> 00:02:12,800
kind of my forte, you know, do companies choose to remain independent or do they choose to

24
00:02:12,800 --> 00:02:13,800
integrate?

25
00:02:13,800 --> 00:02:20,320
Do you see people doing only cultivation, are they doing only dispensaries or are they

26
00:02:20,320 --> 00:02:25,000
doing both cultivation and dispensaries and then perhaps manufacturing?

27
00:02:25,000 --> 00:02:30,480
And again to the extent that most resources in cannabis can flow across state lines you

28
00:02:30,480 --> 00:02:36,720
would expect to see the same types of structures across states.

29
00:02:36,720 --> 00:02:41,700
And if we don't see the same types of structures across states then obviously the regulations

30
00:02:41,700 --> 00:02:43,760
are playing a major role.

31
00:02:43,760 --> 00:02:51,000
And I did this analysis and I looked at California, Oregon, Michigan and Nevada because that's

32
00:02:51,000 --> 00:02:55,040
where I was able to get license data and as I said this is a couple years ago and there

33
00:02:55,040 --> 00:03:01,240
turned out to be big differences across states which I thought was really interesting.

34
00:03:01,240 --> 00:03:06,200
I didn't publish the results but now because Keegan's made this, you know, the license

35
00:03:06,200 --> 00:03:11,080
data so available I've been going through and trying to understand kind of which licenses

36
00:03:11,080 --> 00:03:12,660
are available.

37
00:03:12,660 --> 00:03:18,560
Another interesting thing is some states for example enable distribution or wholesaling

38
00:03:18,560 --> 00:03:21,480
while other states don't.

39
00:03:21,480 --> 00:03:30,200
And so again you're seeing this huge regulatory shaping of the industries so that you're

40
00:03:30,200 --> 00:03:35,600
going to have very different outcomes across states just by virtue of what the regulators

41
00:03:35,600 --> 00:03:39,720
are requiring or restricting or what.

42
00:03:39,720 --> 00:03:44,480
And so anyway long-winded I had done an early analysis on four different states because

43
00:03:44,480 --> 00:03:50,240
that's what I had available but Keegan is making available licensing data from a lot

44
00:03:50,240 --> 00:03:51,240
of different states.

45
00:03:51,240 --> 00:03:55,560
I've been going through supplementing them where he didn't have all the information and

46
00:03:55,560 --> 00:04:00,040
doing it on a lot more states and I'm kind of in the middle of that analysis and I'm

47
00:04:00,040 --> 00:04:01,600
really excited to be able to do it.

48
00:04:01,600 --> 00:04:03,480
So thank you Keegan.

49
00:04:03,480 --> 00:04:15,560
I love it Ruth and hopefully we can put some more statistics in your tool belt today.

50
00:04:15,560 --> 00:04:20,080
As always as I mentioned at the beginning and we'll welcome the newcomers here in a

51
00:04:20,080 --> 00:04:26,640
second, we just go bit by bit, tiniest little statistic molecule that we can.

52
00:04:26,640 --> 00:04:33,400
And please continue looking at the data because I'm sure you'll have more insights then.

53
00:04:33,400 --> 00:04:41,600
That I could think of but one thing that I discovered today is a myriad of statistics

54
00:04:41,600 --> 00:04:48,280
from the ecology literature and that's mainly what I wanted to share with you today and

55
00:04:48,280 --> 00:04:54,400
basically I was going to share with you some statistics about diversity in particular you

56
00:04:54,400 --> 00:04:57,760
know diversity of cannabis products.

57
00:04:57,760 --> 00:05:04,440
But what you could potentially look at is so you're looking at market structure.

58
00:05:04,440 --> 00:05:08,720
For next week I was going to start looking at similarity and that's kind of what you're

59
00:05:08,720 --> 00:05:16,320
looking at and so you could kind of compare similarity across state and I'm sure there's

60
00:05:16,320 --> 00:05:21,500
many ways you can do this but what comes to my mind since I've been studying all this

61
00:05:21,500 --> 00:05:30,400
ecological literature is that you could look at which companies, which licenses are similar

62
00:05:30,400 --> 00:05:33,160
across states.

63
00:05:33,160 --> 00:05:43,360
So if you basically consider each licensee like a species, some will like spread out

64
00:05:43,360 --> 00:05:48,040
and those will be your MSOs, your multi-state operators.

65
00:05:48,040 --> 00:05:55,280
So that's one way you could kind of you know study the licensees but I think you're mostly

66
00:05:55,280 --> 00:05:58,560
looking at you know regulatory effects.

67
00:05:58,560 --> 00:06:04,240
So I'll chew on that for next week and then maybe if you want you're welcome to share

68
00:06:04,240 --> 00:06:09,280
any progress you've had or definitely bring your thoughts.

69
00:06:09,280 --> 00:06:15,600
So that's a good topic that we can chew on this week and cover at the next meetup.

70
00:06:15,600 --> 00:06:18,600
I like that Ruth.

71
00:06:18,600 --> 00:06:21,960
So please be chewing on that.

72
00:06:21,960 --> 00:06:28,360
Real quick I'll let Candice say a quick word and then I'll get to some of the newcomers.

73
00:06:28,360 --> 00:06:35,880
Well we've got some returning people but I want to welcome the newcomer here in a second.

74
00:06:35,880 --> 00:06:41,920
So real quick Candice you have a word that you want to say, anything that you want to

75
00:06:41,920 --> 00:06:51,240
put on the table for today basically on thinking product diversity, Ruth's thinking structural

76
00:06:51,240 --> 00:06:55,920
similarities between states, license structures.

77
00:06:55,920 --> 00:06:56,920
So what's on your mind?

78
00:06:56,920 --> 00:06:59,040
What do you want to put on the table?

79
00:06:59,040 --> 00:07:00,560
I don't have anything to put on the table.

80
00:07:00,560 --> 00:07:07,820
I'm very interested in both your and Ruth's ideas and also too I want to say congratulations

81
00:07:07,820 --> 00:07:08,820
to Ruth.

82
00:07:08,820 --> 00:07:13,040
I saw that your article was published on LinkedIn.

83
00:07:13,040 --> 00:07:15,840
That's fabulous girl and that's it.

84
00:07:15,840 --> 00:07:17,360
That's all I have.

85
00:07:17,360 --> 00:07:18,360
Thank you.

86
00:07:18,360 --> 00:07:19,360
It's great to be here.

87
00:07:19,360 --> 00:07:21,400
Thank you very much.

88
00:07:21,400 --> 00:07:24,080
By all means feel free to put it in the chat and share.

89
00:07:24,080 --> 00:07:25,560
That's exciting news.

90
00:07:25,560 --> 00:07:27,920
Good, good.

91
00:07:27,920 --> 00:07:32,960
Well our newcomer I would love to get your name.

92
00:07:32,960 --> 00:07:40,800
I unfortunately can't read, it appears to be Chinese.

93
00:07:40,800 --> 00:07:44,480
So your name just is in Chinese characters to me.

94
00:07:44,480 --> 00:07:48,360
So if you want to introduce yourself you're welcome to.

95
00:07:48,360 --> 00:07:52,040
Oh hi nice to meet you.

96
00:07:52,040 --> 00:07:55,000
My name is Yuta and I'm a Japanese.

97
00:07:55,000 --> 00:08:00,200
I'm sorry so I couldn't turn on the video because I just woke up right now and I just

98
00:08:00,200 --> 00:08:09,360
out and so right now I'm in Edmonton in Canada and I'm looking for opportunity for the industries

99
00:08:09,360 --> 00:08:15,600
and also right now I'm volunteering the production cannabis company.

100
00:08:15,600 --> 00:08:19,640
So I just really want to know about the industries and people here.

101
00:08:19,640 --> 00:08:21,720
So that's why I just attend here.

102
00:08:21,720 --> 00:08:26,080
So thank you so much for the opportunity.

103
00:08:26,080 --> 00:08:27,080
Happy to have you.

104
00:08:27,080 --> 00:08:31,960
And I believe it was you too and so just correct me if that needs to be corrected.

105
00:08:31,960 --> 00:08:34,360
So happy to have you here.

106
00:08:34,360 --> 00:08:40,840
And Canada is definitely a frontier that we want to start studying.

107
00:08:40,840 --> 00:08:45,560
We've kind of neglected it up to this point and we keep saying that oh yes let's study

108
00:08:45,560 --> 00:08:46,560
Canada.

109
00:08:46,560 --> 00:08:51,960
Canada has had legalized adult use for quite a while now.

110
00:08:51,960 --> 00:08:56,440
I actually was seeing an article I want to say it's been five years or so.

111
00:08:56,440 --> 00:08:58,600
So we want to take a look.

112
00:08:58,600 --> 00:09:05,320
So I love that you put your thoughts on the table because now we can know that hey there's

113
00:09:05,320 --> 00:09:09,560
some demand for cannabis data science in Canada.

114
00:09:09,560 --> 00:09:17,600
Let's go fulfill that demand and maybe you're here to offer some supply too.

115
00:09:17,600 --> 00:09:20,120
So good to have you here you too.

116
00:09:20,120 --> 00:09:21,120
Thank you.

117
00:09:21,120 --> 00:09:29,120
So, Larissa, you want to chime in anything you want to put on the table for today?

118
00:09:29,120 --> 00:09:33,040
No pressure.

119
00:09:33,040 --> 00:09:45,280
And then Caleb, coincidentally you attended on a pristine day because you've kept being

120
00:09:45,280 --> 00:09:52,600
in my ear about the importance of strain cultivar diversity.

121
00:09:52,600 --> 00:09:55,240
And that's exactly what we're going to study today.

122
00:09:55,240 --> 00:10:04,800
Going to take a look at the diversity of strains in Washington and then going to piggyback

123
00:10:04,800 --> 00:10:13,520
on that with something that another Meetup member, Lou, who was in my ear about product

124
00:10:13,520 --> 00:10:15,760
diversity in Connecticut.

125
00:10:15,760 --> 00:10:18,000
So we'll look at both of those today.

126
00:10:18,000 --> 00:10:21,880
You've got some cool statistics to put on your plate.

127
00:10:21,880 --> 00:10:23,920
But anything that you want to put on the table?

128
00:10:23,920 --> 00:10:26,320
Happy to have you here, Caleb.

129
00:10:26,320 --> 00:10:37,780
Yeah, excited to be here and just always interested to see the data and pull some pathways of understanding

130
00:10:37,780 --> 00:10:39,320
out of it.

131
00:10:39,320 --> 00:10:47,440
We're working on structuring the fundamentals for in-field data collection for open collaboration

132
00:10:47,440 --> 00:10:49,240
framework that we've been working on.

133
00:10:49,240 --> 00:10:54,880
And so it'll be really interesting to see what other people in the community really

134
00:10:54,880 --> 00:11:00,800
value in terms of the diversity and different phenotypic traits and other things like that.

135
00:11:00,800 --> 00:11:04,800
So I imagine we'll be looking more at the processed flower diversity.

136
00:11:04,800 --> 00:11:10,560
But yeah, I'm just interested in capturing the diversity of the canvas plant in general,

137
00:11:10,560 --> 00:11:14,720
first in my mind and then eventually into actual quantification systems.

138
00:11:14,720 --> 00:11:22,080
Well, you're in for a treat today, a Dutch treat.

139
00:11:22,080 --> 00:11:25,080
So we've got it.

140
00:11:25,080 --> 00:11:26,080
I mean, just wait.

141
00:11:26,080 --> 00:11:32,300
We've got it, like I said, probably a half a dozen or more really cool visualizations

142
00:11:32,300 --> 00:11:37,280
and statistics to calculate.

143
00:11:37,280 --> 00:11:39,560
And yes, you're going to see a bunch of top strains.

144
00:11:39,560 --> 00:11:47,560
And also, we're going to actually, well, in my analysis with a lesson of why the work

145
00:11:47,560 --> 00:11:50,360
you're doing is so important.

146
00:11:50,360 --> 00:11:51,360
So stay tuned.

147
00:11:51,360 --> 00:11:52,360
That's coming.

148
00:11:52,360 --> 00:11:55,560
So big things coming up soon.

149
00:11:55,560 --> 00:11:59,840
Yasha, you're always doing groundbreaking work.

150
00:11:59,840 --> 00:12:05,440
We've definitely been moving everything forward in Massachusetts.

151
00:12:05,440 --> 00:12:09,000
We'd love, and I mean nationwide for that matter.

152
00:12:09,000 --> 00:12:13,960
So we'd love to hear about anything that you want to put on the table.

153
00:12:13,960 --> 00:12:16,000
Can you hear me?

154
00:12:16,000 --> 00:12:17,000
Yes.

155
00:12:17,000 --> 00:12:22,880
I was having IT problems this morning and I missed your introduction to what you're

156
00:12:22,880 --> 00:12:24,360
going to be talking about today.

157
00:12:24,360 --> 00:12:29,560
So I'm like an excited child that walked into the middle of a conversation and I'm excited

158
00:12:29,560 --> 00:12:34,640
for what you're going to share and the last few words that you said about the diversity

159
00:12:34,640 --> 00:12:35,640
of strains.

160
00:12:35,640 --> 00:12:38,080
Excited to see what you have there.

161
00:12:38,080 --> 00:12:50,320
On my front, I have spoken to I think now more than half of the states that have legalized

162
00:12:50,320 --> 00:12:55,180
cannabis have spoken to I think more than half of the regulators that represent those

163
00:12:55,180 --> 00:13:03,000
states over the last two weeks presented what's happening nationwide.

164
00:13:03,000 --> 00:13:04,560
Excellent discussions.

165
00:13:04,560 --> 00:13:09,400
Nowhere are we getting nearly as deep as the work that you do, Keegan.

166
00:13:09,400 --> 00:13:13,000
It's all very surface on data integrity.

167
00:13:13,000 --> 00:13:18,600
That's pretty much it.

168
00:13:18,600 --> 00:13:23,920
I feel like I'm only scraping the tip of the iceberg.

169
00:13:23,920 --> 00:13:31,200
You're really hammering home some of the really tough to reach statistics in the laboratory

170
00:13:31,200 --> 00:13:36,520
space and I love that you're also spreading the word because as a data scientist, you

171
00:13:36,520 --> 00:13:43,960
have to wear so many hats and sure you may wear the researcher hat, the computer programmer

172
00:13:43,960 --> 00:13:50,160
hat, but you also have to wear the communication expert hat.

173
00:13:50,160 --> 00:13:56,960
You actually have to take these visualizations, statistics, put them in front of people and

174
00:13:56,960 --> 00:14:01,800
help them understand them.

175
00:14:01,800 --> 00:14:10,720
Candice has that old Edward Tufts books too, but one thing Edward Tufts says is basically

176
00:14:10,720 --> 00:14:15,000
your visualization should be able to be read.

177
00:14:15,000 --> 00:14:25,840
If somebody can't understand your visualization, it may be because it's not a good visualization.

178
00:14:25,840 --> 00:14:31,120
It's not telling the story of the data correctly.

179
00:14:31,120 --> 00:14:38,200
One, hopefully the visualization should be readily apparent to people, but two, they're

180
00:14:38,200 --> 00:14:42,040
going to have follow-up questions and that's when things really get fun.

181
00:14:42,040 --> 00:14:49,560
It's when people start to pick apart your research methods.

182
00:14:49,560 --> 00:14:57,200
Just to add a few very quick points, one is I know that within the UX industry, UX profession,

183
00:14:57,200 --> 00:15:02,880
in the 90s the approach was if you have a software and users are not able to use it,

184
00:15:02,880 --> 00:15:03,880
it's the user's fault.

185
00:15:03,880 --> 00:15:11,520
Then there was a change to, no, it's the designer's fault if users are not able to understand

186
00:15:11,520 --> 00:15:12,520
it.

187
00:15:12,520 --> 00:15:14,200
Same with presentation of data.

188
00:15:14,200 --> 00:15:19,960
If you present way too much data thinking I want to display all the work that I've done,

189
00:15:19,960 --> 00:15:27,440
that's not going to come across as well as identifying what's the presentation that would

190
00:15:27,440 --> 00:15:29,960
be the easiest for everyone to understand.

191
00:15:29,960 --> 00:15:31,680
That's important.

192
00:15:31,680 --> 00:15:33,160
Good point.

193
00:15:33,160 --> 00:15:38,600
That's why often I'll just resort to just a bar chart.

194
00:15:38,600 --> 00:15:45,520
I've got a bunch of them for you today because there's a time and a place for different visualizations

195
00:15:45,520 --> 00:15:53,700
and you don't want to just force some fancy chart on a situation because you want to show

196
00:15:53,700 --> 00:15:56,960
off some technique.

197
00:15:56,960 --> 00:16:03,560
It's basically a time and a place for all the various techniques and when in doubt,

198
00:16:03,560 --> 00:16:04,560
go simple.

199
00:16:04,560 --> 00:16:07,280
That's what I do.

200
00:16:07,280 --> 00:16:12,200
But anywho, enough of that spilling.

201
00:16:12,200 --> 00:16:16,960
Let's practice what we preach today and actually do this.

202
00:16:16,960 --> 00:16:22,960
I'll share my screen with you and let's keep it pretty active today because it's a pretty

203
00:16:22,960 --> 00:16:27,560
cool subject and so I don't want to lose your interest at any point.

204
00:16:27,560 --> 00:16:35,720
I'm just going to keep things moving quick because there's a lot of cool ground to cover.

205
00:16:35,720 --> 00:16:39,480
Long story short, got a good laugh out of this.

206
00:16:39,480 --> 00:16:48,000
Saw this on Halloween and wanted to go ahead and share it with you because Yosh has probably

207
00:16:48,000 --> 00:16:55,920
seen more than his fair share of paranormal distributions here in the cannabis industry.

208
00:16:55,920 --> 00:16:57,640
So that's a good laugh.

209
00:16:57,640 --> 00:17:02,120
And then here's a meme that I tried to create.

210
00:17:02,120 --> 00:17:09,760
Hopefully I wanted like a Monty Python style foot like squashing the ghost because today

211
00:17:09,760 --> 00:17:15,560
I discovered a rad distribution.

212
00:17:15,560 --> 00:17:18,880
So just keep in mind the foot.

213
00:17:18,880 --> 00:17:22,440
So hopefully that can be memorable.

214
00:17:22,440 --> 00:17:26,800
Long story short, what's come up a couple times, right?

215
00:17:26,800 --> 00:17:35,560
Ben Calves mentioned that he's concerned with helping people preserve the diversity of cannabis.

216
00:17:35,560 --> 00:17:43,160
And then I heard a member, Lou, mentioned that in Connecticut, since the introduction

217
00:17:43,160 --> 00:17:52,000
of adult use, there's been a decline in just the diversity of products on the shelf.

218
00:17:52,000 --> 00:17:58,200
And basically just wanted to see if we could actually put some numbers to that because,

219
00:17:58,200 --> 00:18:05,080
well, if something's meaningful to you and you also can measure it, well, then you can

220
00:18:05,080 --> 00:18:09,040
kind of manage it and maybe do something about it.

221
00:18:09,040 --> 00:18:13,040
So that's the idea for today.

222
00:18:13,040 --> 00:18:17,280
So what metrics do we even have?

223
00:18:17,280 --> 00:18:24,760
Well, just basically ask chat GPT or start searching around and you'll see that they've

224
00:18:24,760 --> 00:18:30,160
been talking about this in the ecological literature for just a long time.

225
00:18:30,160 --> 00:18:47,200
And basically the easy metrics, you basically hear they're talking about species, but we

226
00:18:47,200 --> 00:18:51,080
can generalize that to anything.

227
00:18:51,080 --> 00:18:55,960
So this could be strains, this could be products.

228
00:18:55,960 --> 00:19:04,520
We were getting fancy and saying this could be licensees.

229
00:19:04,520 --> 00:19:11,360
And actually we'll get super creative later on and we'll actually apply this to chemicals.

230
00:19:11,360 --> 00:19:13,560
So that's when things get wild.

231
00:19:13,560 --> 00:19:23,080
So anywho, the abundance, just the number of each species, then you can almost think

232
00:19:23,080 --> 00:19:26,520
about these as just your typical statistics.

233
00:19:26,520 --> 00:19:38,240
And then evenness, that's going to be kind of how evenly dispersed are the species or

234
00:19:38,240 --> 00:19:41,560
do you have an equal number of each species?

235
00:19:41,560 --> 00:19:46,160
So like Noah's Ark would be perfectly even, right?

236
00:19:46,160 --> 00:19:50,680
He's got two of every species.

237
00:19:50,680 --> 00:19:53,760
So that would be perfect evenness.

238
00:19:53,760 --> 00:20:03,320
And then the other end would be when AI takes over and then you only have like one species.

239
00:20:03,320 --> 00:20:06,360
But just kidding.

240
00:20:06,360 --> 00:20:09,800
That's a future that I don't think is too likely.

241
00:20:09,800 --> 00:20:13,800
But anywho, that would be dominance, right?

242
00:20:13,800 --> 00:20:20,200
So that would be the degree that one species is dominant over the others.

243
00:20:20,200 --> 00:20:23,360
So right now humans are pretty dominant.

244
00:20:23,360 --> 00:20:31,220
And then richness, that's something that's of utmost concern by ecologists.

245
00:20:31,220 --> 00:20:34,360
That's just the number of all the species.

246
00:20:34,360 --> 00:20:42,600
So these two are the ones that you mostly hear people talking about.

247
00:20:42,600 --> 00:20:49,360
So if something is endangered, that's because its abundance is low.

248
00:20:49,360 --> 00:20:54,720
And then people are also talking about the number of species, right?

249
00:20:54,720 --> 00:21:02,680
Like the number of different types of fish in the ocean, so on and so forth.

250
00:21:02,680 --> 00:21:07,240
So let me just rush through this because I don't know, you can find this information

251
00:21:07,240 --> 00:21:08,440
on Wikipedia.

252
00:21:08,440 --> 00:21:16,800
So I don't want to be too repetitive, but it is a nice foundation for us to have.

253
00:21:16,800 --> 00:21:20,320
But basically, let's try to get to math land.

254
00:21:20,320 --> 00:21:25,200
And we can stand on the shoulders of giants.

255
00:21:25,200 --> 00:21:33,000
And this is an equation that people have derived for diversity.

256
00:21:33,000 --> 00:21:38,400
And so it's basically just the sum of all the species, so 1 to r.

257
00:21:38,400 --> 00:21:42,720
So if there's only one species, r is 1.

258
00:21:42,720 --> 00:21:47,240
And then p is the proportion of those species.

259
00:21:47,240 --> 00:21:53,520
So if it's one species, then p is 1.

260
00:21:53,520 --> 00:22:00,880
And then q, I think, is an optional parameter where you can basically weight either the

261
00:22:00,880 --> 00:22:03,640
abundant or the rare species.

262
00:22:03,640 --> 00:22:09,880
If you crank q up, that, I think, puts weight on the abundant species.

263
00:22:09,880 --> 00:22:21,680
And then if you crank q down towards 0, then that puts weight on the rare species.

264
00:22:21,680 --> 00:22:26,840
So long story short, that's sort of the abstract equation.

265
00:22:26,840 --> 00:22:31,320
And then this is the one that we can actually estimate.

266
00:22:31,320 --> 00:22:39,440
So here, we're just, they call this the Shannon Diversity Index.

267
00:22:39,440 --> 00:22:44,240
And it's basically just the sum of all the species.

268
00:22:44,240 --> 00:22:50,280
And we just multiply the proportion by the log of the proportion.

269
00:22:50,280 --> 00:22:59,160
And so if we just do a count, so we just count all the strains, what proportion of them are

270
00:22:59,160 --> 00:23:01,440
Gorilla Glue?

271
00:23:01,440 --> 00:23:05,800
What proportion of them are Super Silver Haze?

272
00:23:05,800 --> 00:23:09,040
What proportion of them are Runts?

273
00:23:09,040 --> 00:23:11,360
So on and so forth.

274
00:23:11,360 --> 00:23:16,840
So long story short, that's readily calculable.

275
00:23:16,840 --> 00:23:25,520
And then I may skip over this, but this is actually the statistic that we've seen in

276
00:23:25,520 --> 00:23:26,940
the past.

277
00:23:26,940 --> 00:23:31,080
So I just wanted to give a quick mention to it, the Simpson Index.

278
00:23:31,080 --> 00:23:37,680
And that's just the sum of all of the proportions squared.

279
00:23:37,680 --> 00:23:42,280
And we actually have seen this before.

280
00:23:42,280 --> 00:23:51,320
Ruth, this is pertinent to you, because this measure, Ruth, do you know what this is?

281
00:23:51,320 --> 00:23:53,320
It goes by another name.

282
00:23:53,320 --> 00:23:58,640
Well, it's a measure of industry concentration.

283
00:23:58,640 --> 00:24:00,520
Exactly.

284
00:24:00,520 --> 00:24:05,640
So economists call this the, what's it?

285
00:24:05,640 --> 00:24:09,200
I'm going to get the H's backwards, but it's the...

286
00:24:09,200 --> 00:24:13,160
Hirsch-Hirfendahl Index.

287
00:24:13,160 --> 00:24:16,200
Exactly the Hirschman-Hirfendahl Index.

288
00:24:16,200 --> 00:24:19,480
And it's basically the exact same statistic.

289
00:24:19,480 --> 00:24:25,520
And basically this is where, you know, in science, there's like a rule that the metric

290
00:24:25,520 --> 00:24:29,100
is never named after the person who first found it.

291
00:24:29,100 --> 00:24:37,600
So I think, you know, somebody did the square root of this, I think back in 1945.

292
00:24:37,600 --> 00:24:42,000
And then Simpson introduced this in 1949.

293
00:24:42,000 --> 00:24:48,200
And then I think Hirfendahl came in at 1950 and maybe popularized it.

294
00:24:48,200 --> 00:24:59,000
And basically, this is a go-to statistic in economics because it's readily calculable.

295
00:24:59,000 --> 00:25:02,920
And you can basically determine the market sheet.

296
00:25:02,920 --> 00:25:12,760
It's a, well, once you have the market share of all the firms, you can figure out how concentrated

297
00:25:12,760 --> 00:25:14,280
the markets are.

298
00:25:14,280 --> 00:25:18,040
And so this is a way that you could actually...

299
00:25:18,040 --> 00:25:23,840
And actually, we've done this to a certain extent, but we only estimated it.

300
00:25:23,840 --> 00:25:34,720
So we estimated the concentration of species, but of licenses, of market share in the various

301
00:25:34,720 --> 00:25:35,720
states.

302
00:25:35,720 --> 00:25:37,040
I mean, how do we do that?

303
00:25:37,040 --> 00:25:42,720
We did that by total revenue divided by number of licenses.

304
00:25:42,720 --> 00:25:49,000
Actually, I'm gonna actually need a refresher on that.

305
00:25:49,000 --> 00:25:51,800
So I'll make sure that that old recording is uploaded.

306
00:25:51,800 --> 00:25:54,800
That may have been in a Saturday morning statistic.

307
00:25:54,800 --> 00:26:03,200
And long story short, Ruth, I wouldn't be surprised if you end up calculating some of

308
00:26:03,200 --> 00:26:04,920
these statistics yourself.

309
00:26:04,920 --> 00:26:15,080
So anywho, just wanted to make that quick connection because it's funny how sometimes

310
00:26:15,080 --> 00:26:21,600
different fields will be in their own silo.

311
00:26:21,600 --> 00:26:31,120
Ecosystems is in its own silo, ecologies in its own silo, but they're still using statistics

312
00:26:31,120 --> 00:26:33,120
at the end of the day.

313
00:26:33,120 --> 00:26:39,720
And one statistic that's useful in one field is often useful in another.

314
00:26:39,720 --> 00:26:45,880
So I'll go ahead and move on because maybe I don't want to bore you too much with that.

315
00:26:45,880 --> 00:26:53,720
But let's go ahead and get to the empirics, get out of theory land and get to the empirics

316
00:26:53,720 --> 00:26:58,160
because I think this is where things get fun.

317
00:26:58,160 --> 00:27:08,080
So here is a figure that's readily calculable with the, and that's the last time I'll say

318
00:27:08,080 --> 00:27:11,360
that word, with the data that we have.

319
00:27:11,360 --> 00:27:23,280
So as always, I basically like to show you a interesting chart or figure from the literature

320
00:27:23,280 --> 00:27:26,480
that we can try to reproduce.

321
00:27:26,480 --> 00:27:40,320
And this is the go-to chart in ecology that really helps us visualize all of these dimensions.

322
00:27:40,320 --> 00:27:46,720
First off, you're plotting abundance.

323
00:27:46,720 --> 00:27:52,000
So just subtract the relative for right now.

324
00:27:52,000 --> 00:27:57,760
So you could just plot abundance and we're going to do that momentarily.

325
00:27:57,760 --> 00:27:59,380
So that's plotted.

326
00:27:59,380 --> 00:28:06,880
Or you could plot relative abundance, which is basically how abundant is everything to

327
00:28:06,880 --> 00:28:13,120
the most abundant species or type.

328
00:28:13,120 --> 00:28:23,680
And so that basically kind of encapsulates dominance, basically how, to what degree is

329
00:28:23,680 --> 00:28:28,040
this first one dominant, so to speak.

330
00:28:28,040 --> 00:28:37,240
And it also helps you visualize richness, which is just the number of species, so how

331
00:28:37,240 --> 00:28:40,560
long is this tail.

332
00:28:40,560 --> 00:28:48,800
So the longer the tail, the richer the community.

333
00:28:48,800 --> 00:28:52,880
So this is a community of species.

334
00:28:52,880 --> 00:28:57,680
And later on, what we can do is actually compare communities.

335
00:28:57,680 --> 00:29:00,600
And so that's basically on the agenda for next week.

336
00:29:00,600 --> 00:29:05,360
That's when we kind of get into Ruth's land, where we kind of think of all the different

337
00:29:05,360 --> 00:29:13,320
states as communities or ecosystems.

338
00:29:13,320 --> 00:29:14,320
That's cool.

339
00:29:14,320 --> 00:29:15,440
That's coming up.

340
00:29:15,440 --> 00:29:17,600
So we've got richness.

341
00:29:17,600 --> 00:29:25,560
And then evenness is basically how steep or flat is this curve.

342
00:29:25,560 --> 00:29:33,040
If we're on Noah's Ark, then this curve is perfectly flat.

343
00:29:33,040 --> 00:29:36,960
Everything's evenly represented.

344
00:29:36,960 --> 00:29:46,360
Conversely, if it's skewed, there's not much evenness.

345
00:29:46,360 --> 00:29:49,920
OK, so you get the idea.

346
00:29:49,920 --> 00:29:55,720
So hopefully, I haven't completely lost you yet, because let's go ahead and actually get

347
00:29:55,720 --> 00:30:03,640
to data land and actually see if we can calculate these statistics.

348
00:30:03,640 --> 00:30:11,080
But while I'm getting everything ready, are there any thoughts, comments, questions?

349
00:30:11,080 --> 00:30:12,080
A quick one.

350
00:30:12,080 --> 00:30:18,880
I believe Noah's Ark had most animals were in pairs, but clean animals, or those that

351
00:30:18,880 --> 00:30:23,640
are kosher were in seven of each.

352
00:30:23,640 --> 00:30:27,920
Sorry, I think that's true.

353
00:30:27,920 --> 00:30:29,520
That's such a fun fact.

354
00:30:29,520 --> 00:30:32,040
So that's wild.

355
00:30:32,040 --> 00:30:42,000
So it will definitely change the abundance curve.

356
00:30:42,000 --> 00:30:45,040
I think that's true, from memory.

357
00:30:45,040 --> 00:30:52,920
That'd be a fun statistics lesson to actually kind of plot out the, oh, what if Noah's Ark

358
00:30:52,920 --> 00:30:53,920
had two animals?

359
00:30:53,920 --> 00:30:58,680
Oh, what if there was actually seven in a certain time?

360
00:30:58,680 --> 00:31:00,960
And a second quick point.

361
00:31:00,960 --> 00:31:07,880
I assume that the graph that you were just showing, if it was not in a logarithmic scale

362
00:31:07,880 --> 00:31:13,160
on the y-axis, it would have been closer to Benford's law?

363
00:31:13,160 --> 00:31:22,240
Oh, you know, I don't remember that one off the top of my head, but we did look at Benford's

364
00:31:22,240 --> 00:31:23,240
law.

365
00:31:23,240 --> 00:31:25,440
So does that follow this similar distribution?

366
00:31:25,440 --> 00:31:33,280
I can't tell because it's logarithmic here, but there should be a, I think so.

367
00:31:33,280 --> 00:31:35,640
I don't want to pause what you're doing.

368
00:31:35,640 --> 00:31:42,800
Well, that's a good note though, because they actually do recommend, well, not recommend,

369
00:31:42,800 --> 00:31:50,320
it's basically, I don't know what the right word is, customary in the literature or in

370
00:31:50,320 --> 00:31:54,000
the field to represent things in log scale.

371
00:31:54,000 --> 00:31:57,960
So we'll actually be doing that today.

372
00:31:57,960 --> 00:32:02,520
So we'll double check if it resembles Benford's law.

373
00:32:02,520 --> 00:32:05,880
So I think that's the occurrence of digits.

374
00:32:05,880 --> 00:32:07,240
Exactly.

375
00:32:07,240 --> 00:32:16,800
Not just digits, but anything that comes up, the most common words in a book or how often

376
00:32:16,800 --> 00:32:21,840
a word comes up in the book, the most common is going to be X percent more often than the

377
00:32:21,840 --> 00:32:24,480
next, the second and so on.

378
00:32:24,480 --> 00:32:26,640
Letters apply same way.

379
00:32:26,640 --> 00:32:28,640
It's not just digits.

380
00:32:28,640 --> 00:32:30,640
I think you're a hundred percent right, Yasha.

381
00:32:30,640 --> 00:32:34,200
Now you're kind of jogging my memory.

382
00:32:34,200 --> 00:32:41,560
I think this person Simpson, I think he was originally studying characters.

383
00:32:41,560 --> 00:32:48,240
So like he was, I think studying like, what's the probability, you know, the next character

384
00:32:48,240 --> 00:32:55,040
is going to be a Z or an A so on and so forth.

385
00:32:55,040 --> 00:33:01,440
And I don't know, it just, it seems like it's just such a small world because that's kind

386
00:33:01,440 --> 00:33:07,280
of like almost exactly the talk of the town right now.

387
00:33:07,280 --> 00:33:11,240
Right now everybody's like, right.

388
00:33:11,240 --> 00:33:16,400
I've already mentioned chat GPT a couple of times, which is basically just the really,

389
00:33:16,400 --> 00:33:19,480
really sophisticated language model.

390
00:33:19,480 --> 00:33:22,360
And this is something that, you know, Simpson, right.

391
00:33:22,360 --> 00:33:28,640
So chat GPT is predicting maybe word like word after word, but you know, Simpson, you

392
00:33:28,640 --> 00:33:34,280
know, started and he's doing it character after character.

393
00:33:34,280 --> 00:33:43,840
It's just, it's just wild how, you know, one giant stood on the shoulders of another giant

394
00:33:43,840 --> 00:33:46,440
who stood on the shoulders of another giant.

395
00:33:46,440 --> 00:33:56,600
And, you know, in the matter of 60 to 80 years, you know, statistics is, I mean, now it's

396
00:33:56,600 --> 00:34:01,240
really just blowing people's minds.

397
00:34:01,240 --> 00:34:09,960
So anywho, these statistics can be used in any field, right.

398
00:34:09,960 --> 00:34:12,560
It's like they can be used in natural language processing.

399
00:34:12,560 --> 00:34:16,360
They can be used in ecology and economics.

400
00:34:16,360 --> 00:34:21,480
And as we'll show you, you're kind of in chemistry.

401
00:34:21,480 --> 00:34:28,560
But let's go ahead and start plotting these because enough talk, let's do.

402
00:34:28,560 --> 00:34:35,920
So what if basically we have the Washington data, I should maybe have told you a little

403
00:34:35,920 --> 00:34:41,760
about it before we just jumped in here and started plotting, but you've seen it enough.

404
00:34:41,760 --> 00:34:46,000
We have about 70,000 lab results for Washington.

405
00:34:46,000 --> 00:34:49,280
We know the strain names.

406
00:34:49,280 --> 00:34:57,200
Of course, the strain names could be cleaned up a good bit because they were entered by

407
00:34:57,200 --> 00:34:59,420
humans.

408
00:34:59,420 --> 00:35:02,800
So we could probably use some natural language processing.

409
00:35:02,800 --> 00:35:07,760
I'll put that on kind of the side burner.

410
00:35:07,760 --> 00:35:14,840
I'll post the code to GitHub where I'd like started to do some natural language processing.

411
00:35:14,840 --> 00:35:22,040
It kind of detracts from the main purpose today, which is to start looking at these

412
00:35:22,040 --> 00:35:24,680
abundance curves.

413
00:35:24,680 --> 00:35:33,880
And so keep in mind, we're roughly trying to reproduce a chart that looks like this.

414
00:35:33,880 --> 00:35:40,320
And this is kind of where I got the idea of the foot from earlier.

415
00:35:40,320 --> 00:35:49,840
So maybe if you need an easy way to remember this kind of distribution, the relative abundance

416
00:35:49,840 --> 00:35:55,960
distribution, the rad distribution, I think of a foot.

417
00:35:55,960 --> 00:36:02,720
But anywho, here I just needed something to talk about while this was plotting.

418
00:36:02,720 --> 00:36:12,280
But here's basically the relative, or actually this isn't relative, this is just the abundance

419
00:36:12,280 --> 00:36:17,920
curve of strains in Washington.

420
00:36:17,920 --> 00:36:24,240
And we can actually find the richness.

421
00:36:24,240 --> 00:36:38,360
So the richness, there's 9,000 strains that have been grown in Washington between 2021

422
00:36:38,360 --> 00:36:45,200
and the end of September of 2023.

423
00:36:45,200 --> 00:36:51,080
So you can play with this time scale for your own interests.

424
00:36:51,080 --> 00:36:58,240
But right now, I kind of weighed the pros and cons of excluding data and decided just

425
00:36:58,240 --> 00:37:04,200
to include everything.

426
00:37:04,200 --> 00:37:06,040
I don't know.

427
00:37:06,040 --> 00:37:09,360
Maybe we should look at it year by year.

428
00:37:09,360 --> 00:37:11,680
So I think I'll do that down below.

429
00:37:11,680 --> 00:37:15,860
But anywho, this is something that everybody wants to know.

430
00:37:15,860 --> 00:37:19,800
So we'll go ahead and find these statistics.

431
00:37:19,800 --> 00:37:23,240
What are the top strains?

432
00:37:23,240 --> 00:37:27,680
And we always mention this time and time again.

433
00:37:27,680 --> 00:37:35,320
And keep in mind, I haven't applied any natural language processing.

434
00:37:35,320 --> 00:37:43,200
We have shown in the past that if you start combining Gorilla Glue and Gorilla Glue number

435
00:37:43,200 --> 00:37:51,240
4 and GG4, that actually does push Gorilla Glue way up the chart.

436
00:37:51,240 --> 00:37:54,040
So I haven't done that here.

437
00:37:54,040 --> 00:37:57,200
So just keep that in mind.

438
00:37:57,200 --> 00:38:02,560
Once again, I'll show you some really, really cool abundance charts here in a second.

439
00:38:02,560 --> 00:38:08,760
But I just wanted to go ahead and introduce you to the concept with just some strains,

440
00:38:08,760 --> 00:38:09,760
right?

441
00:38:09,760 --> 00:38:17,000
Here is just the top 20 in Washington.

442
00:38:17,000 --> 00:38:22,800
And then we don't really need to do this.

443
00:38:22,800 --> 00:38:30,480
But just to repeat, this is just the same charts but with relative abundance.

444
00:38:30,480 --> 00:38:37,840
And here, I think it's relative to the total abundance instead of relative to the first

445
00:38:37,840 --> 00:38:40,160
one.

446
00:38:40,160 --> 00:38:50,240
But this just, I wanted to plot it because it's just a slightly different shaped curve.

447
00:38:50,240 --> 00:38:58,480
So that way, you've seen the shape of actually maybe the same shape curve.

448
00:38:58,480 --> 00:39:04,680
OK, so relative abundance, there you have it.

449
00:39:04,680 --> 00:39:09,360
So just wanted to do it just to say we did it, right?

450
00:39:09,360 --> 00:39:11,460
That was what we set out to do.

451
00:39:11,460 --> 00:39:16,320
Relative abundance curve, relative abundance curve.

452
00:39:16,320 --> 00:39:17,520
Cool.

453
00:39:17,520 --> 00:39:20,200
We checked that checkbox.

454
00:39:20,200 --> 00:39:28,680
But now I actually want to kind of show you where you can take this.

455
00:39:28,680 --> 00:39:37,120
Because it's kind of cool to know, oh, what are the top 20 strains in Washington?

456
00:39:37,120 --> 00:39:42,400
But yeah, what's in a strain name?

457
00:39:42,400 --> 00:39:47,360
So I kind of want to move to the chemical side of things.

458
00:39:47,360 --> 00:39:50,840
Because why do I want to do that?

459
00:39:50,840 --> 00:39:56,480
Well, real quick, I'll show you one last one of these strain abundance charts that kind

460
00:39:56,480 --> 00:40:03,320
of will let me show you why we want to start thinking about chemicals.

461
00:40:03,320 --> 00:40:09,840
Because what if we look at abundance year by year?

462
00:40:09,840 --> 00:40:16,480
And hopefully, one of you can think of a better way to visualize this because I'm not happy

463
00:40:16,480 --> 00:40:19,040
with this visualization.

464
00:40:19,040 --> 00:40:27,240
This is a case where you actually may want a more sophisticated chart than a bar chart.

465
00:40:27,240 --> 00:40:36,360
So here you see the bar chart does a brilliant job of visualizing the relative abundance.

466
00:40:36,360 --> 00:40:39,920
Ooh, thought comment question?

467
00:40:39,920 --> 00:40:46,020
If you were to stack them year by year, so instead of having overlays, stack them.

468
00:40:46,020 --> 00:40:50,960
Then you can see the relative sizes, and then you can also do one based on absolute numbers

469
00:40:50,960 --> 00:40:54,120
and based on proportions.

470
00:40:54,120 --> 00:40:56,120
OK.

471
00:40:56,120 --> 00:41:05,400
I'm not certain I can pull that off right this second.

472
00:41:05,400 --> 00:41:10,640
But chat GPT may.

473
00:41:10,640 --> 00:41:17,240
So real quick, just for fun, see if old chat GPT can pull this off, and then we can maybe

474
00:41:17,240 --> 00:41:19,640
look at chemicals.

475
00:41:19,640 --> 00:41:22,680
Because I think you're absolutely right.

476
00:41:22,680 --> 00:41:32,000
Can you turn this into a stacked bar chart, please?

477
00:41:32,000 --> 00:41:34,240
So we'll just see if this works.

478
00:41:34,240 --> 00:41:35,600
If it works, it works.

479
00:41:35,600 --> 00:41:38,120
And if not, not.

480
00:41:38,120 --> 00:41:47,040
So basically, while chat GPT is drafting that up, the way that I was interpreting my subpar

481
00:41:47,040 --> 00:41:58,040
visualization is that you see some, this is basically the top 20 strains for the three

482
00:41:58,040 --> 00:42:00,720
years.

483
00:42:00,720 --> 00:42:04,360
And you see some strains drop off.

484
00:42:04,360 --> 00:42:15,800
So these strains were top strains in 2021, but not in 2022 and 2023.

485
00:42:15,800 --> 00:42:18,100
Same with the red ones.

486
00:42:18,100 --> 00:42:25,080
The ones that are only red were top strains in 2022, but not in the other years.

487
00:42:25,080 --> 00:42:31,380
And then these ones, there's all three colors, they're top strains each year.

488
00:42:31,380 --> 00:42:37,280
And then you can kind of see if they've increased or decreased from year to year.

489
00:42:37,280 --> 00:42:44,600
So if the yellow is higher, that means they've decreased in popularity from 2021.

490
00:42:44,600 --> 00:42:49,560
And then if the blue is higher, it means like runts.

491
00:42:49,560 --> 00:42:52,920
Runts is increased in popularity.

492
00:42:52,920 --> 00:43:02,120
So let's see if this stacked bar chart just works.

493
00:43:02,120 --> 00:43:07,960
And this is sort of, let's just kind of copy and paste this and just see what happens,

494
00:43:07,960 --> 00:43:09,480
because we only have 15 minutes.

495
00:43:09,480 --> 00:43:16,000
So let's just see how close chat GPT gets.

496
00:43:16,000 --> 00:43:33,560
Looks like, let me just see if this error.

497
00:43:33,560 --> 00:43:36,920
Okay.

498
00:43:36,920 --> 00:43:42,560
So once again, we'll just let chat GPT think in the background and we'll come back at the

499
00:43:42,560 --> 00:43:49,720
end and see if they're able to, and this is where large language models have come.

500
00:43:49,720 --> 00:43:53,480
And this is like the kind of the cool thing about it, right?

501
00:43:53,480 --> 00:44:02,080
Who I doubt Simpson would have ever, ever imagined that, you know, he's trying his

502
00:44:02,080 --> 00:44:09,440
hardest to predict the next letter in a sentence.

503
00:44:09,440 --> 00:44:22,800
And now like we're basically like having chat GPT like try to predict entire blocks of code.

504
00:44:22,800 --> 00:44:32,000
So anywho, and you know, as you can see, I think the model is doing it letter by letter,

505
00:44:32,000 --> 00:44:35,360
or at least word by word.

506
00:44:35,360 --> 00:44:45,840
But anywho, let's go ahead and start looking at some of this chemical data and I'll try

507
00:44:45,840 --> 00:44:58,160
this stacked chart one last time.

508
00:44:58,160 --> 00:45:06,480
Let's just see if this works.

509
00:45:06,480 --> 00:45:11,680
So it may just work.

510
00:45:11,680 --> 00:45:15,160
Okay.

511
00:45:15,160 --> 00:45:22,880
So I'm going to, just so I don't derail the meetup, maybe I'll let you all play around

512
00:45:22,880 --> 00:45:29,000
with this or I'll take a stab at it after the meetup or Ruth, you're welcome to.

513
00:45:29,000 --> 00:45:35,840
But basically that's, I think, as Ruth mentioned, I think that's a better way to visualize this

514
00:45:35,840 --> 00:45:41,360
is a stacked bar chart.

515
00:45:41,360 --> 00:45:52,600
But anywho, I'll thought common question and then I'll then move on to the chemistry side.

516
00:45:52,600 --> 00:45:56,120
Sorry, Day.

517
00:45:56,120 --> 00:45:58,840
Okay.

518
00:45:58,840 --> 00:46:06,360
Well let's change gears a bit.

519
00:46:06,360 --> 00:46:14,040
Programming something you really have to just kind of sit down and you know, you have to

520
00:46:14,040 --> 00:46:18,400
like read it and do it gracefully.

521
00:46:18,400 --> 00:46:22,680
Sometimes I'll try to take a stab at it like I just did there and try to do it on the spot,

522
00:46:22,680 --> 00:46:25,040
but it's difficult.

523
00:46:25,040 --> 00:46:31,760
So long story short, there's a lot of trial and error and I don't want to bore you with

524
00:46:31,760 --> 00:46:33,560
all of my errors.

525
00:46:33,560 --> 00:46:38,640
So let's change gears here.

526
00:46:38,640 --> 00:46:44,720
But long story short, the important thing, the important chart over here is the abundance

527
00:46:44,720 --> 00:46:47,320
of strains.

528
00:46:47,320 --> 00:46:55,520
And essentially for next week, I don't know if we're going to be able to do it because

529
00:46:55,520 --> 00:46:59,200
we'll actually have to use natural language processing, I think.

530
00:46:59,200 --> 00:47:09,200
But you would like to basically compare relative abundance curves from one state to the other.

531
00:47:09,200 --> 00:47:14,960
Which state has richer variety of strains?

532
00:47:14,960 --> 00:47:21,920
You know, California, Oregon, Washington, Colorado, Massachusetts.

533
00:47:21,920 --> 00:47:25,720
You know, it may go by population, but you may be surprised.

534
00:47:25,720 --> 00:47:35,040
You know, Michigan may just have a really rich community of strains.

535
00:47:35,040 --> 00:47:43,160
So that's not good.

536
00:47:43,160 --> 00:47:50,080
Okay, so going to go ahead and change gears here and start looking at the Connecticut

537
00:47:50,080 --> 00:47:53,680
data.

538
00:47:53,680 --> 00:48:04,120
And was able to actually correct a mistake I made in a past week and actually get the

539
00:48:04,120 --> 00:48:06,220
chemical sorted out.

540
00:48:06,220 --> 00:48:16,280
So I hadn't correctly matched up the lab results yet, but I think I have everything

541
00:48:16,280 --> 00:48:17,400
aligned.

542
00:48:17,400 --> 00:48:25,800
So for example, if you look at one of these lab results, here's just this first one.

543
00:48:25,800 --> 00:48:30,660
You see 1.1% THC.

544
00:48:30,660 --> 00:48:52,760
If you go look at the lab result URL, then we get this COA and sure enough, you have

545
00:48:52,760 --> 00:48:56,120
1.1% THC.

546
00:48:56,120 --> 00:49:05,480
What's really kind of bizarre though is I think we're going to need to do some double

547
00:49:05,480 --> 00:49:06,480
checking here.

548
00:49:06,480 --> 00:49:14,160
Because like, so for example, like I don't know what to make of this, right?

549
00:49:14,160 --> 00:49:20,120
Because it says, you know, one to one, it's a one to one drop.

550
00:49:20,120 --> 00:49:40,680
But you know, here they have 1.1, 1.1, but then the COA has, you know, CBD 1.15.

551
00:49:40,680 --> 00:49:41,680
Thought comment question?

552
00:49:41,680 --> 00:49:51,200
Scrolling down just a little bit, no, sorry, on the COA, because within the code, it also

553
00:49:51,200 --> 00:49:55,720
says the Canfor is at 1.1.

554
00:49:55,720 --> 00:50:01,160
And I'm not seeing where that's pulled from.

555
00:50:01,160 --> 00:50:10,400
This is something that identified is an error.

556
00:50:10,400 --> 00:50:12,680
For these three compounds.

557
00:50:12,680 --> 00:50:17,680
So let's do another random sample here.

558
00:50:17,680 --> 00:50:20,600
Right.

559
00:50:20,600 --> 00:50:28,480
Because basically the data integrity of these is something that I'm really, really not sure

560
00:50:28,480 --> 00:50:29,480
about.

561
00:50:29,480 --> 00:50:32,480
There's something else here.

562
00:50:32,480 --> 00:50:35,960
And that's that when you have a brand that advertises, it's putting out a product that

563
00:50:35,960 --> 00:50:41,240
they say, say one to one, then that's kind of what they're hoping for.

564
00:50:41,240 --> 00:50:45,120
But there's going to be some small variation from plant to plant.

565
00:50:45,120 --> 00:50:51,040
So generally, it's about one to one, but not necessarily exactly one to one.

566
00:50:51,040 --> 00:50:53,280
Okay, here.

567
00:50:53,280 --> 00:50:54,720
Exactly.

568
00:50:54,720 --> 00:50:57,240
So that's why that may not have been the best sample.

569
00:50:57,240 --> 00:50:58,680
So here, let's look at this one.

570
00:50:58,680 --> 00:51:06,600
So this one is 73.56, and we do have a spot check there.

571
00:51:06,600 --> 00:51:12,040
But it also says that the CBD is also 73.56 and Canfor is also.

572
00:51:12,040 --> 00:51:21,280
Okay, so I wonder if it's not impossible that this was a coding error on my...

573
00:51:21,280 --> 00:51:32,680
So basically, I haven't yet deduced if this is an error from me collecting the data or

574
00:51:32,680 --> 00:51:35,200
from their actual API.

575
00:51:35,200 --> 00:51:42,960
But here, I'll show you where I'm getting the data and then maybe you can help solve

576
00:51:42,960 --> 00:51:43,960
this problem.

577
00:51:43,960 --> 00:51:47,440
But another thought comment question.

578
00:51:47,440 --> 00:51:58,840
So long story short, this is kind of why I'm calling on you for investigation here.

579
00:51:58,840 --> 00:52:03,160
But I'm almost 100% certain that...

580
00:52:03,160 --> 00:52:08,640
So here's their database.

581
00:52:08,640 --> 00:52:21,680
And so you see in their database, for whatever reason, they're repeating these values.

582
00:52:21,680 --> 00:52:31,160
So I guess just because they did it doesn't necessarily mean that we should ignore the

583
00:52:31,160 --> 00:52:32,160
fault.

584
00:52:32,160 --> 00:52:42,480
So long story short is, this is actually why we're gonna have to do it the hard way.

585
00:52:42,480 --> 00:52:47,880
It's like life just can't...

586
00:52:47,880 --> 00:52:51,360
Nothing's ever easy in this world.

587
00:52:51,360 --> 00:52:59,440
So it's basically like, it would be so nice just to read the data from their API.

588
00:52:59,440 --> 00:53:05,320
But we actually, I think, are going to have to go to the source of truth, which is the

589
00:53:05,320 --> 00:53:06,920
COA.

590
00:53:06,920 --> 00:53:15,440
So basically, at the end of the day, this certificate that has these signatures is the

591
00:53:15,440 --> 00:53:18,440
source of truth.

592
00:53:18,440 --> 00:53:20,800
This is the official document.

593
00:53:20,800 --> 00:53:29,520
So we actually do have the technology to parse these COAs.

594
00:53:29,520 --> 00:53:37,280
I was actually going to prepare that for you for today, but it was just getting too much.

595
00:53:37,280 --> 00:53:41,760
So I think that maybe on the agenda for next week is maybe do some COA parsing, maybe do

596
00:53:41,760 --> 00:53:43,840
some similarity.

597
00:53:43,840 --> 00:53:45,240
We'll do some things like that.

598
00:53:45,240 --> 00:53:48,620
But it was getting a little much for today.

599
00:53:48,620 --> 00:53:58,600
So I may have to exclude CBD too, but I think we're just going to have to proceed for today

600
00:53:58,600 --> 00:54:08,160
just knowing that some of the compounds like THC appear to be being read correctly.

601
00:54:08,160 --> 00:54:15,040
But I think I'm gonna actually have to exclude CBD now too, because it looks like CBD is

602
00:54:15,040 --> 00:54:18,520
incorrect.

603
00:54:18,520 --> 00:54:23,200
So long story short, we're gonna have to get the data out of these COAs.

604
00:54:23,200 --> 00:54:27,320
So that's on the docket for next week.

605
00:54:27,320 --> 00:54:30,440
So let me actually do that real quick.

606
00:54:30,440 --> 00:54:32,520
Where is this bad one?

607
00:54:32,520 --> 00:54:34,040
CBD.

608
00:54:34,040 --> 00:54:40,240
So here I've actually created a list of all the cannabinoids and terpenes, and it looks

609
00:54:40,240 --> 00:54:44,880
like I'm actually gonna have to cut out CBD too.

610
00:54:44,880 --> 00:54:52,600
So these are the first time that I've seen the statistics without this compound.

611
00:54:52,600 --> 00:54:56,000
But good eyes, Yasha.

612
00:54:56,000 --> 00:55:01,880
And this is why it's so important to double check the data.

613
00:55:01,880 --> 00:55:12,400
I'm not even necessarily faulting anybody in Connecticut, because look, you're dealing

614
00:55:12,400 --> 00:55:21,760
with including these three, 41 terpenes.

615
00:55:21,760 --> 00:55:25,920
Whoops, that was misspelled.

616
00:55:25,920 --> 00:55:26,920
That's a function.

617
00:55:26,920 --> 00:55:34,840
So you're dealing with, and I want to say there's 18 cannabinoids, 10 cannabinoids.

618
00:55:34,840 --> 00:55:36,560
So you're dealing with a lot of data.

619
00:55:36,560 --> 00:55:44,200
Okay, so let's exclude those and power through this just to go ahead and do the best we can.

620
00:55:44,200 --> 00:55:50,400
So we're gonna do the best we can with the data that we have, and just know that this

621
00:55:50,400 --> 00:56:01,160
data is super, super suspect, because we've already identified four compounds that appear

622
00:56:01,160 --> 00:56:08,760
to be incorrectly entered.

623
00:56:08,760 --> 00:56:21,480
So long story short, if we think about chemicals as in the same terms as we were thinking about

624
00:56:21,480 --> 00:56:27,200
strain abundance, well, this is chemical abundance.

625
00:56:27,200 --> 00:56:38,560
So you see, of course, THC, THCA, those are the most abundant cannabinoids and cannabis.

626
00:56:38,560 --> 00:56:47,000
But you're also seeing things like, and sorry that this graph is so bad.

627
00:56:47,000 --> 00:56:53,160
If any of you have better ideas about how to visualize this or clean these labels up,

628
00:56:53,160 --> 00:56:55,280
then by all means.

629
00:56:55,280 --> 00:57:00,800
So I'm just gonna do my best, so sorry for this really, really bad visualization.

630
00:57:00,800 --> 00:57:10,080
But you see CBGA, beta-curylophiline, you see betamersine and limonene are quite high.

631
00:57:10,080 --> 00:57:18,240
And then compounds on this end are what you'd call rare compounds.

632
00:57:18,240 --> 00:57:25,280
So Puelgal and Nairol are rare.

633
00:57:25,280 --> 00:57:34,440
So if you're in the business of, they call them exotics.

634
00:57:34,440 --> 00:57:42,040
So those are the people who are doing just the really rare and exotic type strains.

635
00:57:42,040 --> 00:57:51,800
And you may want to look for terpenes that are really, really rare, whatever they may

636
00:57:51,800 --> 00:57:52,800
be.

637
00:57:52,800 --> 00:57:56,880
So terpenine is actually kind of rare.

638
00:57:56,880 --> 00:58:03,520
So that's something you can think about.

639
00:58:03,520 --> 00:58:13,400
And then these are our classic charts.

640
00:58:13,400 --> 00:58:18,080
I'm actually gonna skip over those just because it kind of sidetracks from the main thing

641
00:58:18,080 --> 00:58:21,720
we're doing here.

642
00:58:21,720 --> 00:58:27,000
So actually, I'm not gonna skip over them.

643
00:58:27,000 --> 00:58:31,280
Do you all want me to, it's at 9.30.

644
00:58:31,280 --> 00:58:34,520
So do you want me to go ahead and show you all the visualizations here in like five more

645
00:58:34,520 --> 00:58:35,520
minutes or?

646
00:58:35,520 --> 00:58:36,520
I'd have to step away.

647
00:58:36,520 --> 00:58:37,520
Keegan is always huge.

648
00:58:37,520 --> 00:58:38,520
Thank you.

649
00:58:38,520 --> 00:58:39,520
Thank you, everyone.

650
00:58:39,520 --> 00:58:40,520
All right.

651
00:58:40,520 --> 00:58:41,520
True cool, Yasha.

652
00:58:41,520 --> 00:58:42,520
Thank you for coming.

653
00:58:42,520 --> 00:58:43,520
And how about everybody else?

654
00:58:43,520 --> 00:58:52,880
Do you want to see all the charts or just want me to wrap it up real quick?

655
00:58:52,880 --> 00:58:53,880
I'll stick around.

656
00:58:53,880 --> 00:58:54,880
Okay, let's just do it.

657
00:58:54,880 --> 00:58:58,720
Yeah, I have a few more minutes to stick around and see the charts.

658
00:58:58,720 --> 00:59:00,720
I'm interested so far.

659
00:59:00,720 --> 00:59:01,720
Cool.

660
00:59:01,720 --> 00:59:02,720
Cool.

661
00:59:02,720 --> 00:59:11,240
Well, we'll just do these last few charts real quick because they're interesting enough.

662
00:59:11,240 --> 00:59:19,880
So long story short, we've looked at beta-pinene to d-limonene ratios in the past and just

663
00:59:19,880 --> 00:59:26,960
said, oh, hey, you know, different cannabis strains of different ratios of these.

664
00:59:26,960 --> 00:59:37,320
So one thing that I was kind of curious about is, you know, is this changing over time?

665
00:59:37,320 --> 00:59:44,360
And, you know, is the market changing over time?

666
00:59:44,360 --> 00:59:50,440
And basically what I was wondering is, you know, or kind of more like what we would think

667
00:59:50,440 --> 00:59:55,840
of is like the indica leading, which I kind of do in purple.

668
00:59:55,840 --> 01:00:00,840
And basically the way I think of this is, you know, if you do a quick search online,

669
01:00:00,840 --> 01:00:11,760
you see, oh, sativa indica, like, oh, sativa plants may have like thinner leaves than,

670
01:00:11,760 --> 01:00:14,360
you know, indica plants may be shorter.

671
01:00:14,360 --> 01:00:17,360
And so there's all these sort of ad hoc characteristics.

672
01:00:17,360 --> 01:00:23,480
And I just almost think of this as just a chemical characteristic.

673
01:00:23,480 --> 01:00:31,080
So just as plants have different heights, plants have different chemical profiles.

674
01:00:31,080 --> 01:00:37,640
And so just because something has, you know, a ratio that's high in d-limonene doesn't

675
01:00:37,640 --> 01:00:46,680
necessarily mean it's an indica, but that's just a trait that's typical of what you would

676
01:00:46,680 --> 01:00:50,000
think of as your typical indica plants.

677
01:00:50,000 --> 01:00:56,200
This kind of what I think of this is this is more just a trait.

678
01:00:56,200 --> 01:01:01,680
And so basically I was just curious, like, you know, are we seeing, you know, more or

679
01:01:01,680 --> 01:01:07,480
less of one type of trait over the other over time?

680
01:01:07,480 --> 01:01:18,960
And you can actually conclude with some degree of significance that, you know, it looks like

681
01:01:18,960 --> 01:01:21,640
the market's kind of heading in one direction.

682
01:01:21,640 --> 01:01:27,760
But then I don't know, I don't know how I wouldn't actually read too, too much into

683
01:01:27,760 --> 01:01:32,360
this because it's kind of, it's kind of variable a little bit, right?

684
01:01:32,360 --> 01:01:39,320
Because it's, you know, up here at 0.75, it dips down, then it goes way up.

685
01:01:39,320 --> 01:01:41,200
Now it's dipping down.

686
01:01:41,200 --> 01:01:44,240
So there may actually not be anything to this trend.

687
01:01:44,240 --> 01:01:49,120
We could just, we may just need to see time play out on this one.

688
01:01:49,120 --> 01:01:54,920
I have another information at least there that might be relevant for consideration.

689
01:01:54,920 --> 01:02:02,680
Year to year climate changes and we know that the genetic triggers for these compounds can

690
01:02:02,680 --> 01:02:10,880
be largely associated to, or at least can be largely affected through weather patterns

691
01:02:10,880 --> 01:02:17,560
and even indoor grows, you know, respond to changes in outdoor climate to some degree.

692
01:02:17,560 --> 01:02:28,120
So 2021-22 year was a notably strange year for cannabis farming.

693
01:02:28,120 --> 01:02:29,240
That's a good point.

694
01:02:29,240 --> 01:02:35,160
And I think some things like that, like the climate should be taken into consideration

695
01:02:35,160 --> 01:02:38,920
because this is sort of almost like a different topic.

696
01:02:38,920 --> 01:02:44,120
This is like, that's why I almost skipped this one.

697
01:02:44,120 --> 01:02:48,680
This one's a bit of a tangent.

698
01:02:48,680 --> 01:02:56,080
When you're talking about kind of like, oh, which types are more prevalent over the years,

699
01:02:56,080 --> 01:03:01,600
but I don't know, it's kind of in the ecological realm.

700
01:03:01,600 --> 01:03:06,440
And then I don't know, kind of more into the plant chemistry realm.

701
01:03:06,440 --> 01:03:09,560
So I like that you brought that up, Caleb.

702
01:03:09,560 --> 01:03:20,160
So long story short, something like this, it's not apparent that it's just because people

703
01:03:20,160 --> 01:03:25,440
prefer the higher delimiting.

704
01:03:25,440 --> 01:03:28,920
It could just be noise.

705
01:03:28,920 --> 01:03:33,960
But here's another chart that may actually kind of jump out at us, right?

706
01:03:33,960 --> 01:03:37,000
We want statistics to kind of jump out at you.

707
01:03:37,000 --> 01:03:53,240
So here I just calculated the, remember we talked about earlier, the Shannon index.

708
01:03:53,240 --> 01:03:57,200
At the end of the day, we want some index for diversity.

709
01:03:57,200 --> 01:04:00,880
So there's the Simpson index.

710
01:04:00,880 --> 01:04:02,920
We've looked at that in the past.

711
01:04:02,920 --> 01:04:06,040
Today we're looking at the Shannon index.

712
01:04:06,040 --> 01:04:20,880
And I want to say higher values are more diverse.

713
01:04:20,880 --> 01:04:32,360
And I think if it goes to zero, I want to say it's completely undiverse.

714
01:04:32,360 --> 01:04:41,720
But anyways, here it is.

715
01:04:41,720 --> 01:04:45,960
Sorry, my brain's at about its capacity for today.

716
01:04:45,960 --> 01:04:51,680
So I kind of encourage you all to look into this and make sure I'm doing it right.

717
01:04:51,680 --> 01:05:07,560
But basically, here I just calculated the total diversity in chemicals for each product.

718
01:05:07,560 --> 01:05:11,820
And then I took the average of that over time.

719
01:05:11,820 --> 01:05:18,520
So it's basically like, given a given product.

720
01:05:18,520 --> 01:05:30,560
So for example, this first product, I won't print out everything.

721
01:05:30,560 --> 01:05:37,480
Just the product type and its diversity.

722
01:05:37,480 --> 01:05:59,480
So for example, this first product, it just has a zero.

723
01:05:59,480 --> 01:06:05,960
And that's because it only has THCA.

724
01:06:05,960 --> 01:06:17,160
But if you look at another product, it's got a diversity score of 1.1.

725
01:06:17,160 --> 01:06:25,440
And you see it has THCA.

726
01:06:25,440 --> 01:06:28,400
It has beta-mercine.

727
01:06:28,400 --> 01:06:30,780
It's got terpenyl.

728
01:06:30,780 --> 01:06:33,400
It's got beta-karyophylline.

729
01:06:33,400 --> 01:06:43,600
And so basically, the more diverse the chemical constituents of the product, the higher the

730
01:06:43,600 --> 01:06:48,120
diversity score will be.

731
01:06:48,120 --> 01:06:56,600
And so you can basically see here real quick.

732
01:06:56,600 --> 01:07:00,840
And I'll be wrapping it up.

733
01:07:00,840 --> 01:07:09,920
So you can basically see the distribution of diversity in the market, where you see

734
01:07:09,920 --> 01:07:17,500
some products with just a really high degree of chemical diversity.

735
01:07:17,500 --> 01:07:22,240
So these ones, they're just going to have a lot of different chemicals at different

736
01:07:22,240 --> 01:07:24,160
concentrations.

737
01:07:24,160 --> 01:07:36,120
And then products down here are just going to have fewer chemicals at more similar concentrations.

738
01:07:36,120 --> 01:07:45,760
And then basically, what I did was I just grouped that by month and then took the average.

739
01:07:45,760 --> 01:07:49,000
And it's kind of stark.

740
01:07:49,000 --> 01:07:56,800
And then so for example, here with the beta-pinene to d-liminene, this is where I kind of like

741
01:07:56,800 --> 01:08:04,000
to show you an example of a statistic that doesn't jump out at you too, too much versus

742
01:08:04,000 --> 01:08:06,900
one that really jumps out at you.

743
01:08:06,900 --> 01:08:13,460
This is like a time series that it does have a statistically negative trend.

744
01:08:13,460 --> 01:08:22,680
But it may just be seasonality and we just haven't captured the next season yet versus

745
01:08:22,680 --> 01:08:24,400
a plot like this.

746
01:08:24,400 --> 01:08:29,240
I mean, there's just no other way to put it.

747
01:08:29,240 --> 01:08:38,280
For whatever reason, and the reason may even just be COA data is bad.

748
01:08:38,280 --> 01:08:45,240
It could just be that there was just a structural change in how the COA data was being recorded

749
01:08:45,240 --> 01:08:49,280
at these various junctures in time.

750
01:08:49,280 --> 01:09:01,720
But whatever the reason, it does appear that basically what I'm calling the chemical diversity

751
01:09:01,720 --> 01:09:12,480
of cannabis products in Connecticut has decreased drastically over time.

752
01:09:12,480 --> 01:09:22,040
And this is kind of in line with the anecdote that I had heard from Lou, which was whenever

753
01:09:22,040 --> 01:09:34,520
he basically said, whenever Connecticut allowed adult use, you just saw a decrease in just

754
01:09:34,520 --> 01:09:37,760
the various selection.

755
01:09:37,760 --> 01:09:45,800
So it could be, we may want to put on a historian's hat and basically pinpoint the point in time

756
01:09:45,800 --> 01:09:52,920
when Connecticut permitted adult use and see if we could do a difference in difference

757
01:09:52,920 --> 01:10:00,920
and see, oh, did the allowance of adult use decrease the chemical diversity in cannabis

758
01:10:00,920 --> 01:10:02,920
in Connecticut?

759
01:10:02,920 --> 01:10:04,280
Or maybe it didn't.

760
01:10:04,280 --> 01:10:08,080
Maybe it's just been decreasing over time for another reason.

761
01:10:08,080 --> 01:10:15,360
Once again, it could be COA issues.

762
01:10:15,360 --> 01:10:21,240
But then I just wanted to add one last conditional plot just for the fun of it.

763
01:10:21,240 --> 01:10:27,040
And this is just, remember, there's not that many producers in Connecticut.

764
01:10:27,040 --> 01:10:35,160
So I just was curious about what's the difference between the producers.

765
01:10:35,160 --> 01:10:44,200
And basically you see they're all kind of decreasing their diversity of products.

766
01:10:44,200 --> 01:10:50,640
It does look like these two companies, it looks like they were kind of holding out.

767
01:10:50,640 --> 01:10:58,040
So maybe they were during this time period trying to keep some of the diverse products

768
01:10:58,040 --> 01:11:03,320
on the shelf and then they may have just not been profitable.

769
01:11:03,320 --> 01:11:11,400
And then it looks like for better or for worse, it looks like most people are just kind of

770
01:11:11,400 --> 01:11:23,920
settling on pretty standard set of products that don't have a wide variety of chemical

771
01:11:23,920 --> 01:11:24,920
profiles.

772
01:11:24,920 --> 01:11:32,160
And so maybe someone like Lou can't find the product they're looking for.

773
01:11:32,160 --> 01:11:46,360
But anywho, this is just a way that you can conceptualize diversity of products in cannabis.

774
01:11:46,360 --> 01:11:54,480
And I just thought it was interesting because yes, we could look at strains, but what does

775
01:11:54,480 --> 01:11:57,160
a strain name mean?

776
01:11:57,160 --> 01:12:06,840
I think it's a bit more impactful to actually look at the diversity of chemicals.

777
01:12:06,840 --> 01:12:14,160
But I think both are interesting, but hopefully you all found a nugget here.

778
01:12:14,160 --> 01:12:17,920
But any thoughts, comments, questions?

779
01:12:17,920 --> 01:12:26,360
I know I covered a ton of ground there, but hopefully you found something of interest.

780
01:12:26,360 --> 01:12:36,400
On that note, I want to give you all a big thank you for coming.

781
01:12:36,400 --> 01:12:39,080
I know it was a long meetup.

782
01:12:39,080 --> 01:12:45,800
Hopefully you walked away with some good data, some good statistics and had a little fun

783
01:12:45,800 --> 01:12:50,800
while you were there.

784
01:12:50,800 --> 01:12:53,800
As always, thank you all for coming.

785
01:12:53,800 --> 01:12:55,440
Go on, get on out of here.

786
01:12:55,440 --> 01:12:58,800
Have a productive day and keep advancing cannabis science.

