1
00:00:00,000 --> 00:00:08,120
Welcome to the Cannabis Data Science Meetup Group.

2
00:00:08,120 --> 00:00:13,460
Definitely in for a treat today, as always, going to cover a lot of data, should be some

3
00:00:13,460 --> 00:00:16,080
good research ideas for you.

4
00:00:16,080 --> 00:00:23,060
I'll talk about a couple that are really kind of picking my brain and then some ways that

5
00:00:23,060 --> 00:00:30,240
you could potentially help with this data wrangling if you're so inclined.

6
00:00:30,240 --> 00:00:35,820
But I think we're pushing some new frontier today.

7
00:00:35,820 --> 00:00:44,060
People have been asking for a lot of sales data, finally, finally wrangled north of a

8
00:00:44,060 --> 00:00:47,420
million sales receipts.

9
00:00:47,420 --> 00:00:51,000
So this is going to be good data to look at.

10
00:00:51,000 --> 00:00:59,360
I'll share with you all the trials and tribulations I went through in the data curation phase,

11
00:00:59,360 --> 00:01:05,600
which was non-trivial and would love for you to reproduce it if you wish, because that

12
00:01:05,600 --> 00:01:11,640
would add value just to see, yes, to verify that, yes, you can get the same results.

13
00:01:11,640 --> 00:01:19,780
And also another set of eyes will help maybe find any improvements that could be had.

14
00:01:19,780 --> 00:01:22,900
So there's a couple ways that code can be improved upon, and I'll point that out to

15
00:01:22,900 --> 00:01:23,900
you today.

16
00:01:23,900 --> 00:01:28,320
And then, of course, we'll get to some cool statistics.

17
00:01:28,320 --> 00:01:35,480
I know we were talking about HomeGrow last week and how that could provide an interesting

18
00:01:35,480 --> 00:01:36,700
incentive.

19
00:01:36,700 --> 00:01:44,700
But what are your thoughts before I go too, too, too much further into the data that I've

20
00:01:44,700 --> 00:01:46,480
been pawing around at?

21
00:01:46,480 --> 00:01:53,760
So long time classic, Candice, feel free to share about anything that you want, and then

22
00:01:53,760 --> 00:01:56,280
Luis, I'll get to you.

23
00:01:56,280 --> 00:02:04,560
But any, I owe you a worksheet from last time, and I actually penned it up, but I haven't

24
00:02:04,560 --> 00:02:05,560
shared it yet.

25
00:02:05,560 --> 00:02:11,320
So if you have any ideas there, but anything that you're working on?

26
00:02:11,320 --> 00:02:19,640
I'm still working with the private GPT on my GPUs.

27
00:02:19,640 --> 00:02:27,920
This is going to be a good project because, one, you've got the hardware, and then, two,

28
00:02:27,920 --> 00:02:37,760
I've got a sneaking suspicion that we may need a lot of kind of custom tailored code

29
00:02:37,760 --> 00:02:43,800
to kind of make the output fruitful.

30
00:02:43,800 --> 00:02:47,680
But I'll have some more thoughts to share on that momentarily.

31
00:02:47,680 --> 00:02:50,600
But how about you, Luis?

32
00:02:50,600 --> 00:02:52,760
Happy to see you at the group today.

33
00:02:52,760 --> 00:02:56,400
We'd love to hear about some of the things that you're interested in, especially, you

34
00:02:56,400 --> 00:02:58,280
know, Canvas data related.

35
00:02:58,280 --> 00:02:59,760
Thank you.

36
00:02:59,760 --> 00:03:02,680
It's good to be here.

37
00:03:02,680 --> 00:03:03,680
Good morning.

38
00:03:03,680 --> 00:03:16,880
I'm Luis Coyote, and I have been apparently an outlier in my ability to sort of successfully

39
00:03:16,880 --> 00:03:29,160
convince my state to not only make all Canvas certificate of analysis data public by default,

40
00:03:29,160 --> 00:03:33,640
but also to integrate it into the state's data portal.

41
00:03:33,640 --> 00:03:40,560
So that it no longer requires any sort of manual lever pulling in order to get access

42
00:03:40,560 --> 00:03:43,080
to that data.

43
00:03:43,080 --> 00:03:55,600
Unfortunately, access, as we know, does not come with guaranteed data quality or trustworthiness

44
00:03:55,600 --> 00:03:57,280
of data.

45
00:03:57,280 --> 00:04:04,280
So that's kind of the angle that I find myself attacking these days.

46
00:04:04,280 --> 00:04:10,080
Unfortunately it means I've been doing a lot more policy work and a lot less data science.

47
00:04:10,080 --> 00:04:16,520
You know, it's, I could come up with all the hypotheses in the world and test them all

48
00:04:16,520 --> 00:04:24,400
I want, but what good is that if I don't trust the inputs of the data into, you know, those

49
00:04:24,400 --> 00:04:27,480
regressions I'm running into whatever else I might be doing.

50
00:04:27,480 --> 00:04:40,640
So I think for me, what I'm most interested in is the continuing efforts of folks in other

51
00:04:40,640 --> 00:04:49,400
markets, particularly mass, to achieve the same parity of, you know, data access while

52
00:04:49,400 --> 00:04:56,360
also, you know, understanding that it goes hand in glove with the quality factor.

53
00:04:56,360 --> 00:05:08,040
And I'm also becoming more interested in centralized aggregations of data from different markets

54
00:05:08,040 --> 00:05:15,280
for purposes of sort of cross-market analysis.

55
00:05:15,280 --> 00:05:20,280
And I had been meaning to get involved with this group much sooner and I'm glad to finally

56
00:05:20,280 --> 00:05:22,000
be here.

57
00:05:22,000 --> 00:05:27,600
I've admired and respected the work that both of you have done.

58
00:05:27,600 --> 00:05:35,840
And so mostly, I guess I'll mostly be a kind of a fly on the wall at least to start, but

59
00:05:35,840 --> 00:05:43,880
I really would like to start using that, you know, the data science muscles instead of

60
00:05:43,880 --> 00:05:50,800
the policy crafting muscles because I'm a little bit burnt out on the policy side of

61
00:05:50,800 --> 00:05:58,040
things at this point and I'd like to get back into what I consider to be the more fun stuff.

62
00:05:58,040 --> 00:06:02,480
So maybe that requires a little bit of suspension of disbelief on my part in terms of the quality

63
00:06:02,480 --> 00:06:06,480
of the data that I'm working with currently, but maybe I'm through this group, I'll be

64
00:06:06,480 --> 00:06:12,280
open, my eyes will be opened to additional sources, additional outlets, different avenues

65
00:06:12,280 --> 00:06:16,560
that those pursuits can take that are not necessarily reliant solely on my own state's

66
00:06:16,560 --> 00:06:17,560
data.

67
00:06:17,560 --> 00:06:21,680
So once again, thank you for having me and I appreciate the kindness.

68
00:06:21,680 --> 00:06:23,400
Absolutely.

69
00:06:23,400 --> 00:06:24,400
Love it, Lou.

70
00:06:24,400 --> 00:06:29,120
And you brought up a thousand and one phenomenal points.

71
00:06:29,120 --> 00:06:30,920
Where to even begin?

72
00:06:30,920 --> 00:06:38,040
And well, we'll actually be doing just that today because like you said, it's easy to

73
00:06:38,040 --> 00:06:44,480
get caught up in the weeds and the arguments of the policies, but we're data scientists

74
00:06:44,480 --> 00:06:45,480
here.

75
00:06:45,480 --> 00:06:48,680
So what is our comparative advantage?

76
00:06:48,680 --> 00:06:55,560
It's actually calculating statistics or facts.

77
00:06:55,560 --> 00:07:06,960
So that it may seem irrelevant at times, but I think it adds a lot of value because, you

78
00:07:06,960 --> 00:07:13,600
know, these are statistics that then people can then actually ground their say policy

79
00:07:13,600 --> 00:07:22,160
arguments in because that's often an easy criticism if you're arguing for one point

80
00:07:22,160 --> 00:07:26,360
and then somebody says, oh, do you have any facts to back that up?

81
00:07:26,360 --> 00:07:32,840
Well, you want to have at least a few statistics in your back pocket.

82
00:07:32,840 --> 00:07:37,000
And so that's kind of what we're going to provide today.

83
00:07:37,000 --> 00:07:42,840
And then you said, why even have access to this data?

84
00:07:42,840 --> 00:07:46,440
Alex, welcome to the group.

85
00:07:46,440 --> 00:07:53,080
Just giving a quick synopsis of why it's even important to have cannabis data.

86
00:07:53,080 --> 00:07:57,840
And then what's the point of even calculating these statistics?

87
00:07:57,840 --> 00:08:04,880
And so basically, some of the states are publishing aggregates like aggregate statistics.

88
00:08:04,880 --> 00:08:08,680
So we've seen, oh, there's total sales by month.

89
00:08:08,680 --> 00:08:11,680
And that was actually something that I was going to point you out to.

90
00:08:11,680 --> 00:08:18,760
There is a like a Yahoo Finance article that was talking about the amount consumed by people

91
00:08:18,760 --> 00:08:19,760
in Seattle.

92
00:08:19,760 --> 00:08:25,080
And I'll get you the I can get you the link right now.

93
00:08:25,080 --> 00:08:29,360
And so they're talking about the amount consumed by people in Seattle.

94
00:08:29,360 --> 00:08:37,720
And I thought, well, as good data scientists, we can basically try to replicate their statistic.

95
00:08:37,720 --> 00:08:44,800
And then similarly for the state aggregates, because Washington state publishes the total

96
00:08:44,800 --> 00:08:51,920
amount sold, the total amount of cannabis sold, and so do other states like Massachusetts.

97
00:08:51,920 --> 00:08:55,960
But you kind of have to take it at face value.

98
00:08:55,960 --> 00:09:03,000
And what's cool is in Washington state, we can do a public records request, get the entire

99
00:09:03,000 --> 00:09:06,160
population of cannabis sales.

100
00:09:06,160 --> 00:09:12,480
So every single line item on every single receipt.

101
00:09:12,480 --> 00:09:15,760
And we can validate, right?

102
00:09:15,760 --> 00:09:19,760
We were talking about the importance of, you know, verifying statistics already.

103
00:09:19,760 --> 00:09:26,440
Well, we could potentially verify the Washington state totals, right?

104
00:09:26,440 --> 00:09:30,800
So they're telling us that certain amount is being sold per month.

105
00:09:30,800 --> 00:09:35,040
Well, we've got every single receipt.

106
00:09:35,040 --> 00:09:42,400
So it's just almost an accounting endeavor to just go through every single receipt and

107
00:09:42,400 --> 00:09:44,360
add them all up.

108
00:09:44,360 --> 00:09:49,280
And as I'll demonstrate today, it's non trivial.

109
00:09:49,280 --> 00:09:55,520
So whenever something's non trivial in statistics, you kind of want people to double check your

110
00:09:55,520 --> 00:09:56,520
work.

111
00:09:56,520 --> 00:10:00,720
And so that's, I think, going to be the big argument today for, say, states like Massachusetts

112
00:10:00,720 --> 00:10:08,800
is it's nice for them to publish, say, summary statistics.

113
00:10:08,800 --> 00:10:13,840
But you really, really want to see the raw data because there's so many assumptions that

114
00:10:13,840 --> 00:10:19,840
go into calculating aggregates.

115
00:10:19,840 --> 00:10:20,840
So it's worthwhile.

116
00:10:20,840 --> 00:10:32,120
And while I find this article, Alex, would you want to say a word for yourself?

117
00:10:32,120 --> 00:10:34,240
Maybe what brought you to the group?

118
00:10:34,240 --> 00:10:41,800
And maybe, you know, what do you hope to get out of the marrying of cannabis plus data

119
00:10:41,800 --> 00:10:44,560
science?

120
00:10:44,560 --> 00:10:45,560
Thanks for having me.

121
00:10:45,560 --> 00:10:46,560
Sorry for running behind.

122
00:10:46,560 --> 00:10:52,520
But, yeah, I just I've seen your posts on LinkedIn and I love cannabis and I love data.

123
00:10:52,520 --> 00:10:58,000
I'm a business intelligence student, online student, and just hoping to learn.

124
00:10:58,000 --> 00:10:59,920
I'll follow along the parts I can follow.

125
00:10:59,920 --> 00:11:05,120
And if it's above my level, I'll just listen and figure it out next time, hopefully.

126
00:11:05,120 --> 00:11:08,120
Thank you.

127
00:11:08,120 --> 00:11:16,160
Phenomenal and we cover the gamut as far as cannabis data goes.

128
00:11:16,160 --> 00:11:18,880
So I started out working at a laboratory.

129
00:11:18,880 --> 00:11:26,280
So my heart is a lot in the chemistry side of the cannabis industry.

130
00:11:26,280 --> 00:11:34,520
So I love learning more about, you know, what actually the chemical constituents of the

131
00:11:34,520 --> 00:11:40,280
cannabis plant and then I've becoming increasingly interested in the agriculture side.

132
00:11:40,280 --> 00:11:44,880
So how do people actually cultivate and grow the plant?

133
00:11:44,880 --> 00:11:49,400
Actually increasingly interested in hemp because it's still the cannabis plant at the end of

134
00:11:49,400 --> 00:11:51,760
the day.

135
00:11:51,760 --> 00:11:55,760
The processing side is the part I know the least about.

136
00:11:55,760 --> 00:11:57,520
And then retail.

137
00:11:57,520 --> 00:12:02,200
I come at this from an economics point of view.

138
00:12:02,200 --> 00:12:08,800
So sometimes I maybe get too lost in the weeds when we really start talking about sales and

139
00:12:08,800 --> 00:12:10,680
market structure and things like that.

140
00:12:10,680 --> 00:12:12,680
But some people find it interesting.

141
00:12:12,680 --> 00:12:17,400
So tons of people find different things of interest.

142
00:12:17,400 --> 00:12:25,480
But one thing I love to do is see sort of what people are talking about in various news

143
00:12:25,480 --> 00:12:28,360
outlets and in the mainstream.

144
00:12:28,360 --> 00:12:33,240
Like I said, try to replicate some of their statistics sometimes, always for fun.

145
00:12:33,240 --> 00:12:36,760
But welcome to the group, John.

146
00:12:36,760 --> 00:12:42,720
We'd love to hear what's your angle and what do you hope to get out of the group?

147
00:12:42,720 --> 00:12:43,720
Yeah.

148
00:12:43,720 --> 00:12:45,840
So hi, all.

149
00:12:45,840 --> 00:12:51,920
So I'm a company called Genetica and we do cannabis strain matching.

150
00:12:51,920 --> 00:12:55,720
So matching people's effects to the strains.

151
00:12:55,720 --> 00:13:04,160
And I've been watching your work for a while now and I just really love your open source

152
00:13:04,160 --> 00:13:07,680
tools and the videos and the code.

153
00:13:07,680 --> 00:13:17,200
And it helps provide data sources that we can leverage to really understand, to your

154
00:13:17,200 --> 00:13:24,100
point, the science behind the mix of like, what are all these cannabinoids and terpenes

155
00:13:24,100 --> 00:13:29,000
and how much do they actually influence the total effect that somebody feels?

156
00:13:29,000 --> 00:13:31,640
And doing that on a granular level.

157
00:13:31,640 --> 00:13:37,160
And there's a lot of data out there about effects to a strain, but like not down to

158
00:13:37,160 --> 00:13:38,920
the to the COA level.

159
00:13:38,920 --> 00:13:46,560
So really trying to connect a batch to a review and to a strain and really understand all

160
00:13:46,560 --> 00:13:53,160
the causal factors to, you know, we work with understanding and we work with companies where

161
00:13:53,160 --> 00:14:00,460
we can look at the DNA of people and of the strain to make a match to that level as well.

162
00:14:00,460 --> 00:14:05,960
So that's really kind of where my head's at is right.

163
00:14:05,960 --> 00:14:11,040
There's retail and the things that make you money and do stuff like that.

164
00:14:11,040 --> 00:14:17,080
But my passion's in the science and really understanding the data and what's behind it.

165
00:14:17,080 --> 00:14:19,960
It's great to be here.

166
00:14:19,960 --> 00:14:26,280
I absolutely love it and you've definitely got some big things coming in your space because

167
00:14:26,280 --> 00:14:33,300
I think what's held back a lot of the research is there is not a lot of funding for things

168
00:14:33,300 --> 00:14:38,600
like clinical trials because none of the.

169
00:14:38,600 --> 00:14:45,080
Actually, I don't want to say none because there are some people at academic institutes

170
00:14:45,080 --> 00:14:50,400
doing various types of research, but I don't know how much and how much funding they have.

171
00:14:50,400 --> 00:14:57,160
But a lot of the academic institutions shy away from cannabis research because they don't

172
00:14:57,160 --> 00:15:05,120
want to get their funding stripped away and probably some of the nonprofits, too, for

173
00:15:05,120 --> 00:15:07,000
that matter.

174
00:15:07,000 --> 00:15:13,240
So I love the work that you're doing, and that's really going to be an exciting frontier

175
00:15:13,240 --> 00:15:19,840
and you're going to end up taking it so much further than anything that I could do.

176
00:15:19,840 --> 00:15:22,640
Because like I said, I studied economics.

177
00:15:22,640 --> 00:15:28,400
I wish I had studied chemistry or I still can and still trying, but it's a difficult

178
00:15:28,400 --> 00:15:29,400
subject.

179
00:15:29,400 --> 00:15:34,460
And then, like you said, then you're trying to get into the biochemistry.

180
00:15:34,460 --> 00:15:43,160
So now you're going to need to know all about the human body and extensive biology.

181
00:15:43,160 --> 00:15:51,800
And so it's it's a complicated subject, but I love love your work.

182
00:15:51,800 --> 00:15:57,960
Yeah, no, it's yeah, it's there's no end to the learning that you can do.

183
00:15:57,960 --> 00:16:04,560
And there's always something more that you can, I guess, kind of use the sharp in your

184
00:16:04,560 --> 00:16:08,040
sword because there's a lot of angles to it.

185
00:16:08,040 --> 00:16:15,520
And everything from social determinants of health to like the mental psychology aspects.

186
00:16:15,520 --> 00:16:18,680
I mean, there's a lot of a lot of legs where it could go.

187
00:16:18,680 --> 00:16:20,960
And it's helping people.

188
00:16:20,960 --> 00:16:24,240
So that's what I'm excited about.

189
00:16:24,240 --> 00:16:25,240
Love it, John.

190
00:16:25,240 --> 00:16:26,240
Keep up the good work.

191
00:16:26,240 --> 00:16:32,360
And you're always welcome to use the Canvas Data Science Meetup as sort of a platform.

192
00:16:32,360 --> 00:16:38,800
So a lot of times I'll talk about my latest research ideas or some of the things that

193
00:16:38,800 --> 00:16:40,080
I'm tinkering on.

194
00:16:40,080 --> 00:16:45,520
And you're always welcome to, you know, you're even welcome to prepare a presentation and

195
00:16:45,520 --> 00:16:47,240
come take the floor.

196
00:16:47,240 --> 00:16:49,280
So I love that.

197
00:16:49,280 --> 00:16:50,960
Yeah, absolutely.

198
00:16:50,960 --> 00:16:58,000
Well, well, Jatin, just let me know if I mispronounce your name for any reason.

199
00:16:58,000 --> 00:17:00,320
But welcome to the group.

200
00:17:00,320 --> 00:17:02,200
Happy to have you here.

201
00:17:02,200 --> 00:17:08,560
Would love to hear your angle and what do you hope to walk away at the end of the day

202
00:17:08,560 --> 00:17:10,840
from the Canvas Data Science Meetup?

203
00:17:10,840 --> 00:17:13,920
Hi, morning, everybody.

204
00:17:13,920 --> 00:17:15,940
I completely agree with you.

205
00:17:15,940 --> 00:17:23,960
So I'm a PhD student and I haven't seen any data or analytics on cannabis.

206
00:17:23,960 --> 00:17:27,960
And I was intrigued when I saw this meetup and I was like, that's the kind of meetup

207
00:17:27,960 --> 00:17:35,080
that I would definitely want to go for, academia always shies away from this kind of topics.

208
00:17:35,080 --> 00:17:36,720
And I'm not sure.

209
00:17:36,720 --> 00:17:42,760
And even I try to talk to professors about this, like, why do we go away from these topics

210
00:17:42,760 --> 00:17:47,160
or even like with different, I go to different meetups as well.

211
00:17:47,160 --> 00:17:51,120
So I generally talk to people in health institutions as well.

212
00:17:51,120 --> 00:17:52,120
Why do we go away?

213
00:17:52,120 --> 00:17:53,900
Why don't we talk about this?

214
00:17:53,900 --> 00:17:56,840
Why there's not much research?

215
00:17:56,840 --> 00:17:58,880
So I haven't been having those kind of answers.

216
00:17:58,880 --> 00:18:00,920
So I was like, so forget it.

217
00:18:00,920 --> 00:18:03,960
If I'm not going to get answers from academia, I'll just make some of mine.

218
00:18:03,960 --> 00:18:06,160
I'll just go and do some work on my own.

219
00:18:06,160 --> 00:18:10,480
So that's why I was like, OK, it should be a good meetup.

220
00:18:10,480 --> 00:18:11,480
I love it.

221
00:18:11,480 --> 00:18:14,480
That's the true academic spirit, right?

222
00:18:14,480 --> 00:18:18,880
Just, you know, go where no one's willing to go before.

223
00:18:18,880 --> 00:18:19,880
Exactly.

224
00:18:19,880 --> 00:18:25,840
Well, as promised, I'll get some good data in your hands.

225
00:18:25,840 --> 00:18:36,280
So here's basically a rich gold vein of data that we keep mining.

226
00:18:36,280 --> 00:18:39,400
So this is just the Washington state data.

227
00:18:39,400 --> 00:18:49,920
But this is a good demonstration of how, you know, if you keep at it, exactly, then it's

228
00:18:49,920 --> 00:18:53,920
kind of remarkable how much data is really there.

229
00:18:53,920 --> 00:19:02,120
And in fact, if you're interested, maybe I can share my screen with all of you and kind

230
00:19:02,120 --> 00:19:08,320
of just give a quick demo of what we've been able to pull off with this data, maybe a 15,

231
00:19:08,320 --> 00:19:11,240
20 minute demo, and then we can talk about it afterwards.

232
00:19:11,240 --> 00:19:12,240
Sure, sure.

233
00:19:12,240 --> 00:19:13,240
Go ahead.

234
00:19:13,240 --> 00:19:14,240
OK.

235
00:19:14,240 --> 00:19:15,240
So.

236
00:19:15,240 --> 00:19:24,360
It looks like.

237
00:19:24,360 --> 00:19:33,160
OK, so if for any reason you can't see my screen, you know, just let me know.

238
00:19:33,160 --> 00:19:37,640
It looks like they kind of toggle the user interface a little.

239
00:19:37,640 --> 00:19:38,640
OK.

240
00:19:38,640 --> 00:19:42,720
Here, I'll just start up something new here.

241
00:19:42,720 --> 00:19:45,600
OK, so what are we even working with here?

242
00:19:45,600 --> 00:19:54,320
So if you do a public records request to the Washington State Liquor and Cannabis Board,

243
00:19:54,320 --> 00:19:57,840
just let me know if you need to see the actual language.

244
00:19:57,840 --> 00:20:06,340
But many people do this request, so they have a nice link prepared and they'll give you

245
00:20:06,340 --> 00:20:09,280
a nice zip file.

246
00:20:09,280 --> 00:20:18,400
You know, this one is the latest, you know, seven gigabytes zipped and then many, many

247
00:20:18,400 --> 00:20:27,040
gigabytes, maybe north of 60, so 63 gigabytes of data here.

248
00:20:27,040 --> 00:20:31,400
And we've done many cool things here, like looking at string names.

249
00:20:31,400 --> 00:20:38,920
We've done a lot of analysis of lab results, so there's a lot there.

250
00:20:38,920 --> 00:20:47,360
Just to be frank, one of the hardest times I'm having right now is basically.

251
00:20:47,360 --> 00:20:52,240
And I'll post all this code today and point you in the direction of it.

252
00:20:52,240 --> 00:20:55,680
But we're basically trying to.

253
00:20:55,680 --> 00:21:04,080
Actually, I don't think I've created a diagram yet of a actually.

254
00:21:04,080 --> 00:21:07,640
Yeah, we have a diagram here.

255
00:21:07,640 --> 00:21:14,000
I don't know if I'm going to be able to find it.

256
00:21:14,000 --> 00:21:21,320
So yeah, since I won't be able to find the diagram here, I'll just point you.

257
00:21:21,320 --> 00:21:23,120
I'll share it with you afterwards.

258
00:21:23,120 --> 00:21:28,120
But we've got.

259
00:21:28,120 --> 00:21:31,200
We looked at the low hanging fruit, right?

260
00:21:31,200 --> 00:21:35,440
Looking at licensee details, lab results, strains.

261
00:21:35,440 --> 00:21:40,520
And I finally wanted to get around to sales and so just wanted to be frank with you about

262
00:21:40,520 --> 00:21:47,440
like all the difficulties of putting these sales items together.

263
00:21:47,440 --> 00:21:54,120
That way, maybe you can have some insights and do a better job than I.

264
00:21:54,120 --> 00:21:56,680
Can I interrupt you just for a second?

265
00:21:56,680 --> 00:21:57,680
Yes, please.

266
00:21:57,680 --> 00:22:04,520
So this might be a silly question and I apologize if so, but I'm just trying to understand in

267
00:22:04,520 --> 00:22:13,040
terms of the tooling, I see you've got VS code connected to CDS.

268
00:22:13,040 --> 00:22:23,080
So I'm just wondering if you could like, you know, go through like how you set up that

269
00:22:23,080 --> 00:22:28,600
within the within VS code.

270
00:22:28,600 --> 00:22:32,800
Set up this interactive terminal.

271
00:22:32,800 --> 00:22:35,080
Yeah, yeah.

272
00:22:35,080 --> 00:22:36,080
OK.

273
00:22:36,080 --> 00:22:40,560
Or if it's if that's out of scope for what we're talking about right now, if you push

274
00:22:40,560 --> 00:22:45,520
control shift and P.

275
00:22:45,520 --> 00:22:52,120
That once again, I may have toggled my VS code, so if this doesn't look like how yours

276
00:22:52,120 --> 00:22:57,080
does, then you know, feel free to email me afterwards and we can make sure to set up

277
00:22:57,080 --> 00:22:58,080
your environment.

278
00:22:58,080 --> 00:23:05,560
But the way I go about running code is I make a Jupiter window so that you can type.

279
00:23:05,560 --> 00:23:09,240
I think you can just type up here Jupiter.

280
00:23:09,240 --> 00:23:13,880
So these are Jupiter notebooks that OK, all right.

281
00:23:13,880 --> 00:23:14,880
Exactly.

282
00:23:14,880 --> 00:23:16,840
That was the piece I was missing there.

283
00:23:16,840 --> 00:23:17,840
Thank you.

284
00:23:17,840 --> 00:23:28,040
So that's that's just kind of how I like to play around with code and develop just to.

285
00:23:28,040 --> 00:23:30,560
You'll you'll see my process here momentarily.

286
00:23:30,560 --> 00:23:36,360
I just kind of sometimes like to.

287
00:23:36,360 --> 00:23:39,960
Write code on the fly, but.

288
00:23:39,960 --> 00:23:46,800
I just you can just create an interactive window, so that should be one of your commands.

289
00:23:46,800 --> 00:23:51,520
And then this will connect to your.

290
00:23:51,520 --> 00:23:54,760
Your your Python environment in here.

291
00:23:54,760 --> 00:23:59,640
I've actually set up a.

292
00:23:59,640 --> 00:24:04,480
A virtual environment where.

293
00:24:04,480 --> 00:24:11,520
Which actually may not necessarily be necessary for for the code we're running today, but.

294
00:24:11,520 --> 00:24:14,760
But you can set up a virtual environment.

295
00:24:14,760 --> 00:24:18,440
Yeah, I didn't want to derail the conversation.

296
00:24:18,440 --> 00:24:22,800
Maybe be on the scope for today, but like I said, if you need any help getting the code

297
00:24:22,800 --> 00:24:27,360
to run, I'm more than happy to help with that, especially happy to help with that.

298
00:24:27,360 --> 00:24:29,280
In fact, awesome, thank you.

299
00:24:29,280 --> 00:24:35,960
So but that's a big ugly piece of code, and so this is a smaller script.

300
00:24:35,960 --> 00:24:40,440
And so basically.

301
00:24:40,440 --> 00:24:42,440
I've gone through.

302
00:24:42,440 --> 00:24:49,640
Yes, this is what we're running here and just started to tie together all of these data

303
00:24:49,640 --> 00:24:50,640
sets.

304
00:24:50,640 --> 00:25:01,080
So we've got like basically every line item on the receipt is called a like a sales detail.

305
00:25:01,080 --> 00:25:04,720
So if you actually look at that.

306
00:25:04,720 --> 00:25:06,760
So.

307
00:25:06,760 --> 00:25:12,880
Look at the latest one.

308
00:25:12,880 --> 00:25:18,600
And keep in mind, so this is basically what they so they give you a bunch of their their

309
00:25:18,600 --> 00:25:21,000
data sets.

310
00:25:21,000 --> 00:25:29,960
So there's, you know, 102 sales details.

311
00:25:29,960 --> 00:25:35,000
And if you each one is large, so I'll show you what this looks like.

312
00:25:35,000 --> 00:25:43,600
And basically what the script I've written does is it just iterates through all of the

313
00:25:43,600 --> 00:25:51,840
details and just matches it with all of the pertinent details that are necessary.

314
00:25:51,840 --> 00:25:59,400
Because if you look at this, this is what they they give you for a sales receipt.

315
00:25:59,400 --> 00:26:03,800
So they just give you the ID.

316
00:26:03,800 --> 00:26:07,520
Then you know you have the inventory ID.

317
00:26:07,520 --> 00:26:09,700
Which is critical.

318
00:26:09,700 --> 00:26:17,200
And then you know they tell you the how much you sold for any discount.

319
00:26:17,200 --> 00:26:19,960
And then the sales tax.

320
00:26:19,960 --> 00:26:24,440
And I think this is the excise tax.

321
00:26:24,440 --> 00:26:28,720
So those are the mission critical pieces here.

322
00:26:28,720 --> 00:26:32,480
But you know what are people always interested in?

323
00:26:32,480 --> 00:26:42,040
There's they want to know how much flowers sold or how many edibles were sold or which

324
00:26:42,040 --> 00:26:45,960
producer has the highest sales.

325
00:26:45,960 --> 00:26:53,000
And so we have to now do is that information is not here.

326
00:26:53,000 --> 00:26:59,600
But if you kind of look at how the data is structured, oh, like you can match the sales

327
00:26:59,600 --> 00:27:03,960
detail with the header.

328
00:27:03,960 --> 00:27:08,840
And so the header is basically all the information on the receipt.

329
00:27:08,840 --> 00:27:12,440
So this is you can think of this as a receipt.

330
00:27:12,440 --> 00:27:18,720
And each one of these is a line item on the receipt.

331
00:27:18,720 --> 00:27:21,840
Okay so that's you know one match you have to make.

332
00:27:21,840 --> 00:27:29,560
And then you can you know then you find out okay who actually sold this item.

333
00:27:29,560 --> 00:27:30,560
Wonderful.

334
00:27:30,560 --> 00:27:41,880
And then then you know you have to match the inventory with the inventory item by the inventory

335
00:27:41,880 --> 00:27:44,960
ID.

336
00:27:44,960 --> 00:27:48,400
And then then it's just the big game of matching.

337
00:27:48,400 --> 00:27:55,120
Because then maybe you're interested in knowing what was the most popular strain that sold

338
00:27:55,120 --> 00:27:57,080
in Washington state.

339
00:27:57,080 --> 00:28:05,960
Well now you've got to match the strain ID with the strain data to get the strain name.

340
00:28:05,960 --> 00:28:12,720
So it's just this really kind of convoluted matching game that we have to play.

341
00:28:12,720 --> 00:28:23,760
And basically the one thing that I cannot get matched is I you should be from my understanding

342
00:28:23,760 --> 00:28:33,480
you can and from what we've done it sure appears that you can match the lab results to a subset

343
00:28:33,480 --> 00:28:36,640
of the inventory.

344
00:28:36,640 --> 00:28:40,360
Which makes sense.

345
00:28:40,360 --> 00:28:44,340
People are getting the end products tested.

346
00:28:44,340 --> 00:28:48,160
So any intermediary product may not get tested.

347
00:28:48,160 --> 00:28:55,120
But the idea is they test the end product and then maybe that gets sold.

348
00:28:55,120 --> 00:29:02,080
But I'm not I haven't quite figured out from my understanding this actually may be the

349
00:29:02,080 --> 00:29:08,040
problem and this is kind of how what Washington state licensees think about it.

350
00:29:08,040 --> 00:29:15,040
I think they think about it is in they send in what's called like a parent lot to get

351
00:29:15,040 --> 00:29:16,160
tested.

352
00:29:16,160 --> 00:29:18,400
So that's one inventory item.

353
00:29:18,400 --> 00:29:24,040
And then I think they then split it up into a bunch of children lots.

354
00:29:24,040 --> 00:29:30,080
And so it's basically like they send in an inventory item representing a five pound lot

355
00:29:30,080 --> 00:29:32,280
or what have you.

356
00:29:32,280 --> 00:29:38,360
And then that comes back and then they then divvy that up into a bunch of different inventory

357
00:29:38,360 --> 00:29:40,240
items maybe.

358
00:29:40,240 --> 00:29:43,200
They then go off and get sold.

359
00:29:43,200 --> 00:29:47,080
And maybe that's where the link is breaking.

360
00:29:47,080 --> 00:29:53,600
But but basically you've got an inventory ID here.

361
00:29:53,600 --> 00:29:58,480
And so if you look at the lab results there's an inventory ID there.

362
00:29:58,480 --> 00:30:07,560
So one would think that you could match lab results to inventory items and then consequently

363
00:30:07,560 --> 00:30:11,400
match that with sales.

364
00:30:11,400 --> 00:30:13,920
Did I tried everything under this.

365
00:30:13,920 --> 00:30:16,440
Well maybe maybe I haven't tried everything under this.

366
00:30:16,440 --> 00:30:22,080
Like I said I think there's still something that you could think of here where I think

367
00:30:22,080 --> 00:30:30,520
there's you have to almost match the parent inventory with the child inventory or there's

368
00:30:30,520 --> 00:30:35,480
another data set here products.

369
00:30:35,480 --> 00:30:39,600
I haven't quite figured out how that fits in the mix.

370
00:30:39,600 --> 00:30:48,120
But that's really what's needed because when you buy a product at the store you get the

371
00:30:48,120 --> 00:30:56,880
receipt and then on the label they have you know the THC and CBD percentage.

372
00:30:56,880 --> 00:31:06,360
So for us to get a complete observation here right we need to know the unit price and then

373
00:31:06,360 --> 00:31:14,400
we also have to know the like the THC and CBD content.

374
00:31:14,400 --> 00:31:17,480
And so I'll show you how far I was able to take this.

375
00:31:17,480 --> 00:31:23,600
And so basically I've just been what I call augmenting all of these.

376
00:31:23,600 --> 00:31:33,240
And so right if you just look at this same same data set but here I've augmented every

377
00:31:33,240 --> 00:31:39,320
data point that I can.

378
00:31:39,320 --> 00:31:45,760
So once again other people may have different approaches.

379
00:31:45,760 --> 00:31:52,520
So if you're an SQL wizard then you may actually have better luck just putting all of this

380
00:31:52,520 --> 00:32:03,480
in an SQL database and you know taking it from there.

381
00:32:03,480 --> 00:32:18,180
And in fact some of the so you see some of these actually most of these are incomplete.

382
00:32:18,180 --> 00:32:25,800
So for example you know like out of all of these sales I was only really able to map

383
00:32:25,800 --> 00:32:34,960
like match back you know a small percentage to strain.

384
00:32:34,960 --> 00:32:41,200
And it you know theoretically you know there you know theoretically every single one of

385
00:32:41,200 --> 00:32:50,800
these should should map back to a complete observation.

386
00:32:50,800 --> 00:33:00,280
So what I'm going to basically show you today is just a tiny tiny subset of what I've been

387
00:33:00,280 --> 00:33:03,440
able to wrangle.

388
00:33:03,440 --> 00:33:12,480
So here let me open one more of these just to see if maybe maybe and maybe for some reason

389
00:33:12,480 --> 00:33:18,320
that tail end one had a low match rate.

390
00:33:18,320 --> 00:33:22,480
But let's maybe not maybe there is just low matches all around.

391
00:33:22,480 --> 00:33:37,960
So let me open one more and then I'll share with you the analysis.

392
00:33:37,960 --> 00:33:43,680
Because basically this is the so it looks like it's low matching all across the board.

393
00:33:43,680 --> 00:33:50,920
So this is basically the final like the final frontier or you know the front the final hurdle

394
00:33:50,920 --> 00:34:00,720
that we have to to get over here is is there any way to you know completely flesh out this

395
00:34:00,720 --> 00:34:03,960
data set here.

396
00:34:03,960 --> 00:34:06,440
I didn't realize how sparse it was.

397
00:34:06,440 --> 00:34:14,280
So basically I've you know been running this script and this is sort of the one of the

398
00:34:14,280 --> 00:34:19,720
first times I've really done a really in-depth look and you're in wow we're really missing

399
00:34:19,720 --> 00:34:25,600
a ton of licensees here.

400
00:34:25,600 --> 00:34:34,640
So that's unfortunate but the jar half full view is you know the ones that we are able

401
00:34:34,640 --> 00:34:36,880
to match back.

402
00:34:36,880 --> 00:34:41,600
Well check this out we can actually start doing some some cool things here.

403
00:34:41,600 --> 00:34:48,640
So so here's just just a random licensee.

404
00:34:48,640 --> 00:34:57,640
So basically we can just start you know looking at you know all the you know the sales that

405
00:34:57,640 --> 00:35:07,760
actually that one is new.

406
00:35:07,760 --> 00:35:09,920
So this one looks a bit more complete.

407
00:35:09,920 --> 00:35:12,480
So one two three four five six seven.

408
00:35:12,480 --> 00:35:23,200
So so June should have been the last full month and you know so this is you know a bit

409
00:35:23,200 --> 00:35:31,460
better of data you know if you just sort of look at you know a single licensee.

410
00:35:31,460 --> 00:35:45,360
So maybe for however this licensee operates maybe they don't do the the parent child relationship

411
00:35:45,360 --> 00:35:47,220
with their inventory.

412
00:35:47,220 --> 00:35:53,320
Maybe for whatever reason the way this licensee likes to operate once again I'm conjecturing.

413
00:35:53,320 --> 00:35:59,080
Maybe they just like just a pure one to one relationship all the way through.

414
00:35:59,080 --> 00:36:06,640
Who knows or maybe we're only capturing a tiny subset of sales for this licensee.

415
00:36:06,640 --> 00:36:11,560
But basically we have a benchmark right.

416
00:36:11,560 --> 00:36:18,960
We have the total sales published by the Washington State Liquor and Cannabis Board.

417
00:36:18,960 --> 00:36:27,060
So basically you know the idea here is you know can we just basically go through sum

418
00:36:27,060 --> 00:36:34,480
up all the sales and compare it to their total.

419
00:36:34,480 --> 00:36:43,880
But let's just see here real quick if there's anything of interest here.

420
00:36:43,880 --> 00:36:55,280
So this looks like this was Firehouse in Ellensburg and you know lots of good data there.

421
00:36:55,280 --> 00:37:01,280
And like I said this could potentially provide you know an ambitious data scientist with

422
00:37:01,280 --> 00:37:08,600
you know an endless amount of work right because you know maybe you can dive into and once

423
00:37:08,600 --> 00:37:14,360
again this may not be their complete amount of sales but you can start you know doing

424
00:37:14,360 --> 00:37:18,800
analysis you know on a retail by retail basis.

425
00:37:18,800 --> 00:37:23,120
And I've always you know had the grand idea of oh you know wouldn't it be cool to you

426
00:37:23,120 --> 00:37:31,240
know reach out to some of these retailers if you find you know some I don't know you

427
00:37:31,240 --> 00:37:39,080
know quirk in their data that you know you can say like oh maybe there is a you're you

428
00:37:39,080 --> 00:37:46,280
know you're selling a lot more liquid edibles than everybody else at a lower price or you

429
00:37:46,280 --> 00:37:53,360
know or maybe you're not selling as many liquid edibles as everybody else or you know who

430
00:37:53,360 --> 00:38:03,400
knows that's you know for a data scientist to look at their questions at hand.

431
00:38:03,400 --> 00:38:13,920
Okay so that's the preface that this is still messy work in progress but the idea is once

432
00:38:13,920 --> 00:38:23,440
you have all of these curated and you know here I'm just going to read in all of the

433
00:38:23,440 --> 00:38:26,000
the licensee sales.

434
00:38:26,000 --> 00:38:35,800
So these are just all the sales that are you know fairly well identified from receipt to

435
00:38:35,800 --> 00:38:39,960
to seller and to inventory.

436
00:38:39,960 --> 00:38:45,760
So I'll just start reading these in and so this could take a hot second.

437
00:38:45,760 --> 00:38:49,200
Oops need some packages.

438
00:38:49,200 --> 00:38:57,840
So I'll just be reading this in here and then the classic like the baking show where you

439
00:38:57,840 --> 00:39:04,320
know you already have everything baked.

440
00:39:04,320 --> 00:39:08,200
Here I've already read it in all of the data.

441
00:39:08,200 --> 00:39:10,160
So here it is.

442
00:39:10,160 --> 00:39:14,400
It shouldn't start chugging along here in a second but like I said it'll take about

443
00:39:14,400 --> 00:39:16,400
a minute or two.

444
00:39:16,400 --> 00:39:23,920
But and once again just to show you what this looks like you can basically just take a random

445
00:39:23,920 --> 00:39:30,660
sample and just look at one of these observations here.

446
00:39:30,660 --> 00:39:43,320
So this was usable cannabis sold at Zips cannabis in Tacoma.

447
00:39:43,320 --> 00:39:55,080
It was sold the January 30th 2023.

448
00:39:55,080 --> 00:40:05,600
And it was this lemon cherry gelato pre-roll you know and then you you've got the strain

449
00:40:05,600 --> 00:40:09,600
lemon cherry gelato.

450
00:40:09,600 --> 00:40:15,960
It looks like this may have cost Zips maybe 450.

451
00:40:15,960 --> 00:40:22,000
Once again I wouldn't read too much into these numbers and then Lou this is where you talked

452
00:40:22,000 --> 00:40:25,800
about data quality.

453
00:40:25,800 --> 00:40:38,880
You have to be super careful about your insights because you can't rely too too much on or

454
00:40:38,880 --> 00:40:45,720
maybe that's sort of the art where data science goes is maybe you may have to just exclude

455
00:40:45,720 --> 00:40:57,080
any licensees who aren't accurately recording their costs or this or that because you see

456
00:40:57,080 --> 00:40:58,640
wild things here right.

457
00:40:58,640 --> 00:41:01,480
You see negative number right.

458
00:41:01,480 --> 00:41:13,760
So for example we could just say I wonder if we can I wonder how this works.

459
00:41:13,760 --> 00:41:18,600
Something else worth mentioning is that it might be one of these like steel sharp and

460
00:41:18,600 --> 00:41:22,740
steel scenarios with you know the overall data set.

461
00:41:22,740 --> 00:41:30,400
You may be able to use some of the data to expose quality issues with other parts of

462
00:41:30,400 --> 00:41:32,520
the data.

463
00:41:32,520 --> 00:41:34,960
So that's always worth keeping in mind.

464
00:41:34,960 --> 00:41:35,960
Exactly.

465
00:41:35,960 --> 00:41:42,920
But you see you know there are some negatives there so just you know be wary about that.

466
00:41:42,920 --> 00:41:45,680
You know if you're summing up costs then there's a negative.

467
00:41:45,680 --> 00:41:49,560
How do you handle that?

468
00:41:49,560 --> 00:41:57,820
And you'll see a similar thing with discounts where the discount field is also a really

469
00:41:57,820 --> 00:41:59,320
messy field.

470
00:41:59,320 --> 00:42:03,600
But there could be insights there.

471
00:42:03,600 --> 00:42:10,760
You know like what like you know there could be there could be really critical insights

472
00:42:10,760 --> 00:42:12,040
there.

473
00:42:12,040 --> 00:42:20,600
I mean that's what economists are always studying is price and they're talking about real price.

474
00:42:20,600 --> 00:42:23,920
You know what's the real price the consumer pays.

475
00:42:23,920 --> 00:42:29,280
So you're doing cost studies price studies.

476
00:42:29,280 --> 00:42:33,960
You definitely want to have a nice accurate measure of the discounts.

477
00:42:33,960 --> 00:42:41,080
As I said I haven't been able to map back the lab results yet.

478
00:42:41,080 --> 00:42:45,960
And this is something that of course everybody's super interested in.

479
00:42:45,960 --> 00:42:54,640
And the one question that has been on everybody's mind and we've been asking this question now

480
00:42:54,640 --> 00:43:01,800
for over two years is how much does THC matter.

481
00:43:01,800 --> 00:43:07,960
First did the I mean first the question is does THC matter.

482
00:43:07,960 --> 00:43:10,100
Just a binary yes or no.

483
00:43:10,100 --> 00:43:15,040
You can basically just do and we've said all of these those two questions.

484
00:43:15,040 --> 00:43:17,960
Does does THC matter.

485
00:43:17,960 --> 00:43:28,960
And if so how much does THC matter can be run can be estimated by just a simple regression

486
00:43:28,960 --> 00:43:37,560
basically of price on.

487
00:43:37,560 --> 00:43:42,760
And total THC look they're even right beside each other.

488
00:43:42,760 --> 00:43:44,600
So that that's all you need.

489
00:43:44,600 --> 00:43:49,600
But unfortunately you know the data should be there.

490
00:43:49,600 --> 00:43:53,600
But I cried pretty extensively.

491
00:43:53,600 --> 00:43:59,000
And so like I said so this is where you could potentially have a breakthrough because if

492
00:43:59,000 --> 00:44:05,960
you can match the lab results to which which we have.

493
00:44:05,960 --> 00:44:06,960
Right.

494
00:44:06,960 --> 00:44:13,840
So you have these nice curated lab results.

495
00:44:13,840 --> 00:44:15,520
So you're all show you the two ones.

496
00:44:15,520 --> 00:44:21,400
So here's lab results that haven't been matched inventory.

497
00:44:21,400 --> 00:44:28,080
And then these looks like there's some problems.

498
00:44:28,080 --> 00:44:32,760
And then here are some lab results that have been matched to inventory.

499
00:44:32,760 --> 00:44:37,000
So I'll show you those two.

500
00:44:37,000 --> 00:44:44,800
So I think what's causing this error is I think one of the fields somewhere in here

501
00:44:44,800 --> 00:44:52,360
write some product description or strain name I think begins with an equal sign in Excel

502
00:44:52,360 --> 00:44:54,340
doesn't like them.

503
00:44:54,340 --> 00:45:00,480
So once again if you can find you know what strain or product name that is or implement

504
00:45:00,480 --> 00:45:04,320
a fix that's a necessary fix.

505
00:45:04,320 --> 00:45:12,560
And but but but but but long story short the this is basically if you just want the raw

506
00:45:12,560 --> 00:45:16,140
lab results we have those.

507
00:45:16,140 --> 00:45:23,240
And so as I said they've got an inventory ID and then you can get all the cool things

508
00:45:23,240 --> 00:45:27,960
about these particular samples right there.

509
00:45:27,960 --> 00:45:44,040
The Delta 9 THC CBD CBD the moisture content once again completely under analyzed field

510
00:45:44,040 --> 00:45:49,720
here when from my perspective I don't think the labs.

511
00:45:49,720 --> 00:45:53,320
Hey well maybe things have kind of changed.

512
00:45:53,320 --> 00:45:55,720
Actually it's different in different states.

513
00:45:55,720 --> 00:46:04,440
Some states people report dry weight where they basically it's actually an inflation

514
00:46:04,440 --> 00:46:09,960
so they actually inflate the number kind of relative to the moisture content.

515
00:46:09,960 --> 00:46:13,000
So in some states it really really matters.

516
00:46:13,000 --> 00:46:22,880
Washington State I'm fairly certain reports wet weight numbers and maybe somebody trying

517
00:46:22,880 --> 00:46:25,160
to join your quick.

518
00:46:25,160 --> 00:46:30,200
Hopefully hopefully some psychosis.

519
00:46:30,200 --> 00:46:44,560
Long story short there may or may not be much that you can actually get like knowledge wise

520
00:46:44,560 --> 00:46:47,080
out of moisture and water activity.

521
00:46:47,080 --> 00:46:51,200
But any data is better than no data.

522
00:46:51,200 --> 00:46:58,560
And as I pointed out hopefully to people's attention is you know it's worthwhile looking

523
00:46:58,560 --> 00:47:02,480
at pesticides and residual solvents.

524
00:47:02,480 --> 00:47:08,400
There's actually almost I mean there may be some but I want to say there's either none

525
00:47:08,400 --> 00:47:13,320
or almost no heavy metal detections.

526
00:47:13,320 --> 00:47:17,680
Let's actually confirm real quick.

527
00:47:17,680 --> 00:47:25,680
So let's just look at heavy metals and I want to say there's actually I take that back.

528
00:47:25,680 --> 00:47:27,960
It looks like.

529
00:47:27,960 --> 00:47:30,160
Right.

530
00:47:30,160 --> 00:47:41,560
So out of what is there out of like seventy five thousand lab tests there were about one

531
00:47:41,560 --> 00:47:45,960
hundred that that heavy metal was detected in.

532
00:47:45,960 --> 00:47:49,560
So that's that's pretty serious.

533
00:47:49,560 --> 00:48:02,280
You know once again these are outliers but sometimes it's interesting to study the outliers

534
00:48:02,280 --> 00:48:03,280
right.

535
00:48:03,280 --> 00:48:05,060
And there's only so much time in the day.

536
00:48:05,060 --> 00:48:10,720
So maybe this is a cool pet project for one of you data scientists is you know why don't

537
00:48:10,720 --> 00:48:13,840
you look at.

538
00:48:13,840 --> 00:48:17,480
So here we could actually do the same thing and we can get more information so we could

539
00:48:17,480 --> 00:48:23,760
actually see you know is there any sort of you know pattern right.

540
00:48:23,760 --> 00:48:27,080
That's what humans are all about right is looking for patterns.

541
00:48:27,080 --> 00:48:33,940
So is there any pattern to to the samples that failed for heavy metals.

542
00:48:33,940 --> 00:48:40,720
So once again I wouldn't be surprised if there if there's some interesting.

543
00:48:40,720 --> 00:48:53,320
Looks like they're yeah we are thinking that they may be some sort of concentrates because.

544
00:48:53,320 --> 00:48:57,200
You can get sort of all sorts of background contamination right.

545
00:48:57,200 --> 00:49:03,720
So if you're like like say you're making edibles in your you know you're grinding it up or

546
00:49:03,720 --> 00:49:08,880
something like maybe you're using a metal grinder or something like that and you're

547
00:49:08,880 --> 00:49:11,160
getting but they're getting hits.

548
00:49:11,160 --> 00:49:14,280
So maybe who knows right.

549
00:49:14,280 --> 00:49:19,800
So I don't want to actually make any I don't want to make any I guess they'd be hypotheses

550
00:49:19,800 --> 00:49:23,440
but I don't think this stage they're more of conjectures.

551
00:49:23,440 --> 00:49:28,480
So I don't really want to just take any stabs at this but.

552
00:49:28,480 --> 00:49:33,040
But I'm just kind of pointing this out is there's really cool.

553
00:49:33,040 --> 00:49:43,480
Data still to be had by adding the lab results here.

554
00:49:43,480 --> 00:49:47,200
Final thing is.

555
00:49:47,200 --> 00:49:50,440
You know what what's the point of all of this.

556
00:49:50,440 --> 00:50:00,120
Well at the end of the day the point is calculating statistics and once again.

557
00:50:00,120 --> 00:50:06,120
And or neglect or.

558
00:50:06,120 --> 00:50:10,880
Don't take into consideration the statistics I'm about to show you because as I pointed

559
00:50:10,880 --> 00:50:17,880
out I don't think I've matched all of the licensees to sales.

560
00:50:17,880 --> 00:50:24,400
So this is only the total sales for the people I have identified.

561
00:50:24,400 --> 00:50:30,800
And so this is going to explain why the numbers look really really low.

562
00:50:30,800 --> 00:50:34,480
But.

563
00:50:34,480 --> 00:50:40,760
But we can find out the you know we can start to basically these are basically.

564
00:50:40,760 --> 00:50:47,600
So I may have oversold the statistics that I was telling you about today but as I said.

565
00:50:47,600 --> 00:50:48,600
Here's the code.

566
00:50:48,600 --> 00:50:55,720
I'll have this posted to GitHub today for you to peruse through and see if there's any

567
00:50:55,720 --> 00:50:58,840
improvements we can have here.

568
00:50:58,840 --> 00:51:03,600
But the whole point is.

569
00:51:03,600 --> 00:51:10,440
Can't we now take this a step further than just look at aggregate sales because I mean

570
00:51:10,440 --> 00:51:15,960
how meaningful is that like say there's a hundred million.

571
00:51:15,960 --> 00:51:20,560
Per month or 50 million per month.

572
00:51:20,560 --> 00:51:23,520
Different states it's in it's in that ballpark.

573
00:51:23,520 --> 00:51:28,160
I mean what what does that even mean you know and that's where we've started to say like

574
00:51:28,160 --> 00:51:34,360
oh well maybe sales per retailer is more relevant.

575
00:51:34,360 --> 00:51:41,960
But then we ran into the dish the distribution problem where you know not all retailers are

576
00:51:41,960 --> 00:51:50,960
are operating the same right there and they're not all located in the same place.

577
00:51:50,960 --> 00:51:57,080
The classic hoteling problem so you know the long story short you can start doing fun things

578
00:51:57,080 --> 00:52:00,920
like you know.

579
00:52:00,920 --> 00:52:06,680
Calculating how many products are sold you can look at how many products are sold by

580
00:52:06,680 --> 00:52:09,880
retailer and here's a hint.

581
00:52:09,880 --> 00:52:20,320
In the past we we did run a regression where we did find a positive correlation between

582
00:52:20,320 --> 00:52:27,640
the amount of products I want to say sold but it could have been stopped but I want

583
00:52:27,640 --> 00:52:28,640
to say sold.

584
00:52:28,640 --> 00:52:33,200
So just the total amount of different products you were selling in your revenue.

585
00:52:33,200 --> 00:52:39,360
And this kind of goes in line with something someone told me once about a retailer and

586
00:52:39,360 --> 00:52:45,080
he says you just want to have your product your shelves filled with a bunch of different

587
00:52:45,080 --> 00:52:48,800
types of products and you want to keep cycling those.

588
00:52:48,800 --> 00:52:55,280
And once again this is just here say this is just what this this person recommends.

589
00:52:55,280 --> 00:53:01,000
But he says you know you want to keep your customers seeing new products you know they're

590
00:53:01,000 --> 00:53:04,240
going to get bored if they only see the same things on the shelves.

591
00:53:04,240 --> 00:53:11,760
Once again he may be right he may be wrong but it's kind of fun right cannabis.

592
00:53:11,760 --> 00:53:17,400
They're always coming up with new strain names so maybe that maybe people maybe they find

593
00:53:17,400 --> 00:53:24,520
that fun maybe they're looking for the new fun strain name this this month.

594
00:53:24,520 --> 00:53:29,440
And but but this is basically where I.

595
00:53:29,440 --> 00:53:38,760
I started to realize my numbers were completely wrong because I'm only I if you only account

596
00:53:38,760 --> 00:53:49,360
everything that can be matched back to inventory I can only account for about less than 24

597
00:53:49,360 --> 00:53:52,200
million in sales.

598
00:53:52,200 --> 00:54:00,800
But I want to say once again we've calculated this in the past and so I'm just going to

599
00:54:00,800 --> 00:54:07,200
have to go off of my Bayesian prior aka my best recollection.

600
00:54:07,200 --> 00:54:14,080
But my best recollection is maybe say like I said I think sales in Washington maybe around

601
00:54:14,080 --> 00:54:17,040
50 to 100 million a month.

602
00:54:17,040 --> 00:54:23,080
So I think we've only accounted for a small fraction not a small but definitely only a

603
00:54:23,080 --> 00:54:26,320
fraction of sales.

604
00:54:26,320 --> 00:54:29,560
But but you know what are we after here.

605
00:54:29,560 --> 00:54:36,880
You know we can start doing things like oh you know we can try to calculate the effective

606
00:54:36,880 --> 00:54:37,960
tax rate.

607
00:54:37,960 --> 00:54:38,960
So you know what.

608
00:54:38,960 --> 00:54:45,600
What are after you add up all the taxes you know how much your people are paying.

609
00:54:45,600 --> 00:54:52,720
And here is actually a statistic that could maybe be extrapolated out.

610
00:54:52,720 --> 00:55:00,080
But once again we've got a non random sample out of our population of sales or non random

611
00:55:00,080 --> 00:55:06,080
sample is the sales items I've been able to curate.

612
00:55:06,080 --> 00:55:10,680
But this is a statistic that I found interesting.

613
00:55:10,680 --> 00:55:11,680
Right.

614
00:55:11,680 --> 00:55:17,400
So we're just going to look at the proportion of products sold.

615
00:55:17,400 --> 00:55:25,240
I mean so this this statistic may or may not change as we we take into consideration all

616
00:55:25,240 --> 00:55:30,120
of the other sales items.

617
00:55:30,120 --> 00:55:40,800
But long story short is you know by Bayesian prior was that flower was around 60 percent

618
00:55:40,800 --> 00:55:43,280
of all sales.

619
00:55:43,280 --> 00:55:51,600
But it's looking like either things have changed in the Washington market.

620
00:55:51,600 --> 00:56:00,520
These statistics are incorrect or potentially people are purchasing more of the some of

621
00:56:00,520 --> 00:56:06,720
these other types of products like the liquid edibles and solid edibles.

622
00:56:06,720 --> 00:56:13,000
We'll have to double check but you know those are each approaching.

623
00:56:13,000 --> 00:56:17,240
They're still shy each of about 10 percent of sales.

624
00:56:17,240 --> 00:56:21,680
But that may be up.

625
00:56:21,680 --> 00:56:22,680
That may be up.

626
00:56:22,680 --> 00:56:29,800
I want to say in the past maybe liquid and edibles combined were around 10 percent.

627
00:56:29,800 --> 00:56:38,920
And so you know if it increases from around 10 percent to around 15 percent of total sales

628
00:56:38,920 --> 00:56:42,840
those sales have to come from somewhere.

629
00:56:42,840 --> 00:56:45,800
So who knows.

630
00:56:45,800 --> 00:56:50,200
Maybe people are doing some sort of substitution there.

631
00:56:50,200 --> 00:57:03,360
But anywho I've kind of lost my confidence in my own statistics for better or for worse.

632
00:57:03,360 --> 00:57:08,360
So to make of that what you will.

633
00:57:08,360 --> 00:57:11,800
But I don't know what what do you make of it.

634
00:57:11,800 --> 00:57:19,720
Do you do you think we need to to refine some of this data curation a little bit.

635
00:57:19,720 --> 00:57:26,000
Use some of the other low hanging fruit or you know work on this cherry picker.

636
00:57:26,000 --> 00:57:29,600
You know what are your thoughts.

637
00:57:29,600 --> 00:57:31,480
Please.

638
00:57:31,480 --> 00:57:37,840
Data cleaning is is great.

639
00:57:37,840 --> 00:57:44,320
And then also to you know if we started putting in tests little tests you know to to make

640
00:57:44,320 --> 00:57:49,760
sure you know if there are empty values you know check on that.

641
00:57:49,760 --> 00:57:52,440
And I don't know.

642
00:57:52,440 --> 00:57:55,640
Awesome job Keegan.

643
00:57:55,640 --> 00:57:58,920
Well it will be awesome before long.

644
00:57:58,920 --> 00:58:05,240
So as I said if we are somehow able to match everything up.

645
00:58:05,240 --> 00:58:12,920
So basically if we're able to match every single sales item to a piece of inventory

646
00:58:12,920 --> 00:58:20,120
and then we should be able to match every thing that was sold to a lab result via the

647
00:58:20,120 --> 00:58:27,680
inventory then we'll have a complete picture of the Washington state market.

648
00:58:27,680 --> 00:58:35,280
And this is kind of what we've been chasing and I feel like we're so so so close.

649
00:58:35,280 --> 00:58:42,160
And I think this would really just I think this would finally be like the big breakthrough

650
00:58:42,160 --> 00:58:48,600
everybody's been looking for because you could finally start to answer some of these really

651
00:58:48,600 --> 00:58:52,440
interesting questions.

652
00:58:52,440 --> 00:58:57,800
For example is you know just to play a devil's advocate.

653
00:58:57,800 --> 00:59:02,600
People are saying oh there's a problem with high THC cannabis.

654
00:59:02,600 --> 00:59:08,840
Well you could then look at the zip codes where there's a lot of high THC cannabis being

655
00:59:08,840 --> 00:59:12,640
sold you know see if that correlates with anything negative.

656
00:59:12,640 --> 00:59:14,960
So that's one thing you could do.

657
00:59:14,960 --> 00:59:22,560
As I said I'd personally would just love to see how much THC matters.

658
00:59:22,560 --> 00:59:24,360
If it even matters at all.

659
00:59:24,360 --> 00:59:25,360
Right.

660
00:59:25,360 --> 00:59:29,320
For example for edibles it may not matter at all.

661
00:59:29,320 --> 00:59:32,440
Or it may.

662
00:59:32,440 --> 00:59:38,440
So I'm you.

663
00:59:38,440 --> 00:59:39,440
I don't know.

664
00:59:39,440 --> 00:59:44,080
I just feel like I just feel like the sky's the limit.

665
00:59:44,080 --> 00:59:48,200
Once as Candice points out the data is you know finally clean.

666
00:59:48,200 --> 00:59:55,840
So I don't think it's I don't think it's quite at that stage but it's I almost am calling

667
00:59:55,840 --> 00:59:59,760
on you for help because it's almost as far as I can take it.

668
00:59:59,760 --> 01:00:06,160
So I really need what some one of you to have like almost a brilliant insight here in you

669
01:00:06,160 --> 01:00:12,640
know how can we map like match this inventory together.

670
01:00:12,640 --> 01:00:19,400
So I strongly encourage you to help on this because as I said I've got a sneaking suspicion

671
01:00:19,400 --> 01:00:28,880
that there's either like some sort of like parent child inventory relationship or maybe

672
01:00:28,880 --> 01:00:32,240
the product data is the missing piece.

673
01:00:32,240 --> 01:00:38,000
But there's just I feel like there's just this one final piece that'll bring everything

674
01:00:38,000 --> 01:00:41,560
together.

675
01:00:41,560 --> 01:00:43,900
So I'm sharing this with you.

676
01:00:43,900 --> 01:00:49,480
So I hope you're excited about this and the potential.

677
01:00:49,480 --> 01:00:51,080
That's awesome Keegan.

678
01:00:51,080 --> 01:00:58,720
So for everybody to can litics on that hub there is a cannabis data science that you

679
01:00:58,720 --> 01:01:05,840
can you know clone and also to there is a high can litics dot com.

680
01:01:05,840 --> 01:01:11,320
There's a Slack chance and because that's what I'm gathering right Keegan that you know

681
01:01:11,320 --> 01:01:17,080
we're looking for other eyeballs and you know certainly to you know Keegan if you're busy

682
01:01:17,080 --> 01:01:22,840
to when people need some help with setting up environments or you know running code to

683
01:01:22,840 --> 01:01:28,680
can litics you know if you want you can toss it over to me and if anybody appreciates my

684
01:01:28,680 --> 01:01:35,040
help you don't have to pay me just donate to cannabis data science and can litics.

685
01:01:35,040 --> 01:01:36,780
So it's so good.

686
01:01:36,780 --> 01:01:41,560
Thank you Keegan.

687
01:01:41,560 --> 01:01:43,080
I absolutely love it.

688
01:01:43,080 --> 01:01:50,240
Thanks for the shout out Candace and I'm a little behind on committing the code but today

689
01:01:50,240 --> 01:01:54,120
is the day right.

690
01:01:54,120 --> 01:01:58,920
You know the Python philosophy you know now is better than ever and you know today is

691
01:01:58,920 --> 01:02:00,300
finally the now.

692
01:02:00,300 --> 01:02:03,800
So I'm going to get all the code posted today.

693
01:02:03,800 --> 01:02:10,660
As I mentioned it's going to be messy and there's an invitation to the Slack channel

694
01:02:10,660 --> 01:02:15,440
so you're welcome to join us there and then as I said I'm going to be doing a better job

695
01:02:15,440 --> 01:02:23,600
at you know filling out issues on GitHub making sure it's a bit more you know generally approachable

696
01:02:23,600 --> 01:02:29,560
because there's some real cool data curation scripts here because the idea is you can take

697
01:02:29,560 --> 01:02:36,160
that say public records request from the Washington State LCB and then you can sort of run these

698
01:02:36,160 --> 01:02:38,800
duration scripts one by one.

699
01:02:38,800 --> 01:02:46,920
So the lab result one after a lot of fine tuning it used to take two hours to run and

700
01:02:46,920 --> 01:02:49,960
we've now have it down to eight minutes.

701
01:02:49,960 --> 01:02:57,560
So you can now curate that data set of lab results I was showing you in about eight minutes

702
01:02:57,560 --> 01:03:03,200
which there's a lot of a lot a lot you can poke around that there.

703
01:03:03,200 --> 01:03:06,600
Like I was just showing you with the heavy metals.

704
01:03:06,600 --> 01:03:12,120
I mean that's a whole research question of its own that people are super interested in.

705
01:03:12,120 --> 01:03:18,520
People are I mean different states are requiring heavy metal testing and I mean a big question

706
01:03:18,520 --> 01:03:21,220
is is you know is it worth it right.

707
01:03:21,220 --> 01:03:28,440
It's a really really expensive test and so if there's if all of the detections are just

708
01:03:28,440 --> 01:03:36,520
oddballs out then is it worth it or maybe they snagged some really hard finds.

709
01:03:36,520 --> 01:03:41,280
So so that's a really interesting research question.

710
01:03:41,280 --> 01:03:46,880
And then the sales items it takes about two hours.

711
01:03:46,880 --> 01:03:52,440
It takes about once again maybe on a nice rig like Candice's you may make short work

712
01:03:52,440 --> 01:04:05,000
of this but for me it takes about two to four hours per sales data file and there are I

713
01:04:05,000 --> 01:04:09,800
have GPU so you know let me know too if I get kind of a little lost in my Lang Chang

714
01:04:09,800 --> 01:04:16,480
and trying out the different free open source LLMs you know help bring me back down to earth

715
01:04:16,480 --> 01:04:21,920
and just put in the put what you want run you know want me to run and download and and

716
01:04:21,920 --> 01:04:25,800
you know I should be able to you know get you results you know within a hopefully a

717
01:04:25,800 --> 01:04:29,920
quick you know GPU amount of time.

718
01:04:29,920 --> 01:04:38,800
It's phenomenal and we're kind of going under this sort of the blockchain philosophy here

719
01:04:38,800 --> 01:04:46,360
in that we could ideally distribute this in that if we all follow the same curation principles

720
01:04:46,360 --> 01:04:50,400
you can almost say check if one was curated the same way.

721
01:04:50,400 --> 01:04:55,920
So we basically just create a hash of the whole data set and if we each get the same

722
01:04:55,920 --> 01:04:58,580
hash then we know it's the same.

723
01:04:58,580 --> 01:05:08,680
So maybe I curate 101 and 102 and then Candice calculates you know 80 through 101 and then

724
01:05:08,680 --> 01:05:15,320
you know I can check that 101 was right so yes I you know you can kind of verify that

725
01:05:15,320 --> 01:05:22,640
and is this process was right and then yeah you know Candice can do a lot of the work.

726
01:05:22,640 --> 01:05:33,320
So I just had my computer running for the past week or two so like I said it took you

727
01:05:33,320 --> 01:05:38,960
know 200 to 400 hours of computing but actually I had like I was running it sometimes like

728
01:05:38,960 --> 01:05:50,200
in five parallel processes sometimes so maybe 50 to 60 hours with some breaks there to let

729
01:05:50,200 --> 01:05:52,120
my computer cool down.

730
01:05:52,120 --> 01:06:00,480
So it takes a long time to curate the data and I made a mistake so we're going to have

731
01:06:00,480 --> 01:06:06,280
to fix the mistake and then you know recreate all of them but that's okay like I said it

732
01:06:06,280 --> 01:06:12,320
was a good learning process to just see how much of the data we actually can match how

733
01:06:12,320 --> 01:06:20,480
much is still to be matched and then you know I'll get all this in your hands as much as

734
01:06:20,480 --> 01:06:29,160
I can and yeah all hands on deck you know put your favorite data science tools to use

735
01:06:29,160 --> 01:06:34,040
and let's uncover some knowledge.

736
01:06:34,040 --> 01:06:41,760
So want to go ahead and thank you all for coming I can't thank you enough it's your

737
01:06:41,760 --> 01:06:47,400
eyes your ears your attention that's really moving things forward you're the people who

738
01:06:47,400 --> 01:06:50,760
are making the cannabis data science meetup happen.

739
01:06:50,760 --> 01:06:57,880
I hope we're advancing the cannabis space and cannabis science if even only a molecule

740
01:06:57,880 --> 01:07:18,160
at a time I'm happy with that so I hope you're happy and had a little fun too.

