1
00:00:00,000 --> 00:00:17,000
Good morning, Charles and Brooke and everyone else. So thank you for joining the Cannabis Data Science Meetup group.

2
00:00:17,000 --> 00:00:24,000
My name is Keegan, the founder of Canlytics, a company to make cannabis analysis simple and easy.

3
00:00:24,000 --> 00:00:38,000
Today we will be looking at lab results in Washington State, and Charles has prepared some work trying to predict if a sample may fail quality assurance testing.

4
00:00:38,000 --> 00:00:48,000
And then I can follow up with some analysis on cannabinoids afterwards. So without further ado, Charles, would you like to present some of your work?

5
00:00:48,000 --> 00:00:58,000
Sure. Let's see, I've actually never presented.

6
00:00:58,000 --> 00:01:03,000
Where is that?

7
00:01:03,000 --> 00:01:05,000
And can I even do it?

8
00:01:05,000 --> 00:01:11,000
So it should look like a box with an up arrow.

9
00:01:11,000 --> 00:01:21,000
Oh, that one. Okay.

10
00:01:21,000 --> 00:01:23,000
Cool.

11
00:01:23,000 --> 00:01:51,000
Okay, so previously I had done some work with trying to predict sample failures in the lab. And so I came up, and so most of the stuff at the beginning is just sort of reading in the files and cleaning them up.

12
00:01:51,000 --> 00:01:59,000
And so, and then encoding them and splitting the data into test training and testing sets.

13
00:01:59,000 --> 00:02:05,000
And then because there's a huge class imbalance.

14
00:02:05,000 --> 00:02:12,000
There's way more passing samples and there are failing samples.

15
00:02:12,000 --> 00:02:25,000
And so I try to manipulate these class weights to help the model compensate for the lack of failing data.

16
00:02:25,000 --> 00:02:50,000
And this just, all the data is categorical, right? It's, you know, it's not numeric. So in order to use CatBoost, you tell it to which columns are categorical. And then CatBoost is really good for dealing with categorical data, and it tends to be a fairly good classifier.

17
00:02:50,000 --> 00:03:10,000
So you train the model and what I had done is I had taken this data and used Optuna to come up with an optimal set of parameters for this particular data with CatBoost.

18
00:03:10,000 --> 00:03:34,000
And then Optuna is a genetic algorithm that goes through and it'll try something and they'll try a set of parameters, and then it'll, you know, it'll come up with a, it'll actually come up with a population of parameters and then try those, take the best ones, and then use those to come up with new parameters.

19
00:03:34,000 --> 00:03:41,000
And this takes about a day, day and a half to run.

20
00:03:41,000 --> 00:03:48,000
So, last time, a big part of the problem was is I didn't really have a goal.

21
00:03:48,000 --> 00:03:56,000
So Keegan had pointed out that the big goal was to not miss any failing samples.

22
00:03:56,000 --> 00:04:00,000
Because last time I tried, I was kind of trying to optimize everything.

23
00:04:00,000 --> 00:04:07,000
So I came up with this particular model and it only misses one of the failing samples.

24
00:04:07,000 --> 00:04:13,000
So, right, it pretty much meets your goal.

25
00:04:13,000 --> 00:04:23,000
But, oh, and so, and also CatBoost will tell you like what feature, what, how important each one of the features were.

26
00:04:23,000 --> 00:04:30,000
So the lab ID turned out to be the most important for some reason.

27
00:04:30,000 --> 00:04:40,000
I mean, maybe particular, either a certain lab or certain labs are more strict or certain labs just tend to get more samples that fail.

28
00:04:40,000 --> 00:04:42,000
You know, we don't really know.

29
00:04:42,000 --> 00:04:46,000
But this, the whole thing bothered me.

30
00:04:46,000 --> 00:04:48,000
This is kind of really odd.

31
00:04:48,000 --> 00:04:56,000
I mean, there's obviously a lot of false positives, you know, thousands.

32
00:04:56,000 --> 00:04:59,000
So that's a lot.

33
00:04:59,000 --> 00:05:04,000
So I was wondering, is this thing really learning anything?

34
00:05:04,000 --> 00:05:08,000
And the answer is no.

35
00:05:08,000 --> 00:05:11,000
It's basically a dummy classifier.

36
00:05:11,000 --> 00:05:19,000
So what it does is it predicts that marijuana fails 99 percent of the time.

37
00:05:19,000 --> 00:05:26,000
When in reality, it only fails 12 percent of the time.

38
00:05:26,000 --> 00:05:35,000
That was the sneaking suspicion I started to get last time was maybe there's a certain product that tends to fail more than others.

39
00:05:35,000 --> 00:05:47,000
And like you said, it's basically just pinpointed that sample type and just and you're basically saying it's just saying that that sample type fails every time.

40
00:05:47,000 --> 00:05:53,000
Yeah, basically. Yeah. 99.5 percent of the time. So, yeah, it basically.

41
00:05:53,000 --> 00:06:03,000
Well, that's what the classifier thinks or predicts that marijuana fails almost all the time.

42
00:06:03,000 --> 00:06:06,000
So and but yeah, it does.

43
00:06:06,000 --> 00:06:11,000
It does fail more than the than the other product types.

44
00:06:11,000 --> 00:06:15,000
But in reality, it's only 12 percent.

45
00:06:15,000 --> 00:06:26,000
And is this mixed marijuana by chance or it's just labeled as the type is labeled as marijuana.

46
00:06:26,000 --> 00:06:32,000
I don't know what the the intermediate type by chance.

47
00:06:32,000 --> 00:06:37,000
No, that's the the actual product type.

48
00:06:37,000 --> 00:06:42,000
OK, we'll dive into some of this data here in a second.

49
00:06:42,000 --> 00:06:50,000
So from my understanding, the intermediate type is really.

50
00:06:50,000 --> 00:06:54,000
Is it is kind of the main classifier.

51
00:06:54,000 --> 00:07:05,000
So. Like, so, for example, here, just like the type, right. So everything's going to be categorized essentially as an end product.

52
00:07:05,000 --> 00:07:10,000
Well, everything that makes it to the store shelves.

53
00:07:10,000 --> 00:07:15,000
And so I'm curious about the product type.

54
00:07:15,000 --> 00:07:21,000
So for marijuana, the the main product type or.

55
00:07:21,000 --> 00:07:25,000
Was it subtype or what did you say it was?

56
00:07:25,000 --> 00:07:29,000
I can't. It's your immediate type, intermediate type.

57
00:07:29,000 --> 00:07:33,000
It was intermediate type. And so that was flower.

58
00:07:33,000 --> 00:07:39,000
But if you go back to the old notebook.

59
00:07:39,000 --> 00:07:45,000
There is very few entries in the table that had intermediate type.

60
00:07:45,000 --> 00:07:49,000
Most of them were missing.

61
00:07:49,000 --> 00:07:56,000
The product type or just type was the main was the one that was filled in the most.

62
00:07:56,000 --> 00:08:00,000
So intermediate type was kind of was missing most of the time.

63
00:08:00,000 --> 00:08:09,000
Like 85 percent of the entries were missing intermediate type.

64
00:08:09,000 --> 00:08:16,000
And I did try and go through and fill those in, but it just kind of didn't seem to.

65
00:08:16,000 --> 00:08:21,000
There's kind of no clear way forward with that.

66
00:08:21,000 --> 00:08:24,000
Just.

67
00:08:24,000 --> 00:08:30,000
Looking at some of the variables here. So.

68
00:08:30,000 --> 00:08:34,000
Were those just not end products by chance?

69
00:08:34,000 --> 00:08:37,000
The ones without intermediate type.

70
00:08:37,000 --> 00:08:47,000
No, in in product also was missing a large number of intermediate types.

71
00:08:47,000 --> 00:08:51,000
I would have loved to have used intermediate type.

72
00:08:51,000 --> 00:08:54,000
As an input, but it just.

73
00:08:54,000 --> 00:09:00,000
There was kind of no clear way forward as to how to fill in those missing values.

74
00:09:00,000 --> 00:09:12,000
So let's power on for now and then we'll start poking at this data here in just a second, because I've done some follow up work with Canterbury analysis.

75
00:09:12,000 --> 00:09:16,000
And then we can dive back into the sample type discussion.

76
00:09:16,000 --> 00:09:23,000
But please continue because no need to get hung up on this for the time being.

77
00:09:23,000 --> 00:09:37,000
Yeah, no, I'd be interested to, you know, to find out. You know what features are really viable and and how we could fill in some of that missing data.

78
00:09:37,000 --> 00:09:44,000
So the next thing was in product, which was predicted to fail one and a half percent of the time.

79
00:09:44,000 --> 00:09:51,000
And it actually fails. In reality, it fails one percent of the time. So this was pretty close.

80
00:09:51,000 --> 00:10:03,000
Can I pause you there? So this is where savvy cannabis licenses will.

81
00:10:03,000 --> 00:10:22,000
Do some interesting analytics, predictive behavior, cost benefit analysis, because essentially what you can do is you can factor, say, a one percent chance that your products fail into your your cost, your estimated costs.

82
00:10:22,000 --> 00:10:24,000
So that way you can budget correctly.

83
00:10:24,000 --> 00:10:33,000
So, for example, if you don't take that one percent chance of failure into consideration to that's in economics terms, that's a cost.

84
00:10:33,000 --> 00:10:42,000
Right. Because there's a one percent chance. And upon that chance, there's a cost.

85
00:10:42,000 --> 00:10:47,000
Right. You're going to you're going to have to expend to destroy the product.

86
00:10:47,000 --> 00:10:54,000
It's going to be a loss in inventory. So that's essentially an expected cost.

87
00:10:54,000 --> 00:11:04,000
So. So, so long story short, I think this is something that businesses should take into consideration.

88
00:11:04,000 --> 00:11:12,000
So when you're starting your cultivation, I think you should factor into your your cost, the estimated probability that you may fail.

89
00:11:12,000 --> 00:11:22,000
Because what I what I've seen is cultivators, processors, they don't account for the probability of failure.

90
00:11:22,000 --> 00:11:33,000
And so then when something does fail, it becomes a disaster because they say, oh, like we know they didn't they didn't budget that they didn't plan for that.

91
00:11:33,000 --> 00:11:37,000
And now all of a sudden it's hard for them to make ends meet.

92
00:11:37,000 --> 00:11:47,000
So so I just wanted to pause you there just to kind of drill that point home that although it's just a one percent chance of failure,

93
00:11:47,000 --> 00:12:00,000
if you're producing large amounts of flour or concentrates, you know, one percent is not it's not negligible.

94
00:12:00,000 --> 00:12:07,000
So something to pay attention to. Anyways, Charles just wanted to drive that point home real quick.

95
00:12:07,000 --> 00:12:11,000
OK, no, that's that's a good point. That's.

96
00:12:11,000 --> 00:12:17,000
Especially in startup businesses. Yeah, there's nobody.

97
00:12:17,000 --> 00:12:21,000
Yeah, nobody. A lot of people don't factor in that something could go wrong.

98
00:12:21,000 --> 00:12:30,000
And when something does go wrong, then it is a disaster because they're operating on such a tight budget.

99
00:12:30,000 --> 00:12:33,000
And they haven't thought about these things.

100
00:12:33,000 --> 00:12:38,000
And you raise an interesting dimension, another factor, the size of the company.

101
00:12:38,000 --> 00:12:51,000
So this number may be biased down to say large cultivations, they have things under control, maybe for the most part.

102
00:12:51,000 --> 00:12:59,000
That's an assumption. But, you know, maybe they they're have their, you know, their flower rooms well.

103
00:12:59,000 --> 00:13:09,000
Quarantined and they're able to expend a bit more to keep their failure rate low. Plus, they're sending in tons of samples.

104
00:13:09,000 --> 00:13:17,000
So it may look like the failure rates low. But if you do it conditional on business size,

105
00:13:17,000 --> 00:13:25,000
I wouldn't be surprised if smaller businesses may even have a slightly higher failure rate.

106
00:13:25,000 --> 00:13:39,000
That's interesting. And so is there a way to get the size of the cultivator out of out of the data or or maybe even the length of time they've been in business?

107
00:13:39,000 --> 00:13:44,000
Well, there's a couple of ways you could proxy the size of the business.

108
00:13:44,000 --> 00:13:57,000
So you could just do sales. So you could there's sales data so you can rank the companies by their sales.

109
00:13:57,000 --> 00:14:05,000
There there is tiers. So technically there's tier one, tier two and tier three.

110
00:14:05,000 --> 00:14:14,000
That data may be there. I suspect it is, but we'll want to I think that data is there.

111
00:14:14,000 --> 00:14:20,000
So we'll want to make make certain so you could just look at the tiers. That would be the simplest.

112
00:14:20,000 --> 00:14:25,000
Do tier ones have a higher failure rate than tier three?

113
00:14:25,000 --> 00:14:38,000
And then next, I would run a regression of if you can do that. I'm thinking if you just coded zero one as fail,

114
00:14:38,000 --> 00:14:50,000
you may be able to run a regression on like a like a logistic regression on failure rate on sales.

115
00:14:50,000 --> 00:14:56,000
So do companies that have higher sales, is their failure rate lower?

116
00:14:56,000 --> 00:15:04,000
And my hypothesis is it is it may be because they you know if they have higher revenue,

117
00:15:04,000 --> 00:15:17,000
then they may be able to expend a bit more just on, you know, quality control, keeping the rooms clean.

118
00:15:17,000 --> 00:15:22,000
OK, checking your HVAC system. You know, you know how it is.

119
00:15:22,000 --> 00:15:28,000
You know, sometimes smaller businesses may try to cut corners, try to keep costs low.

120
00:15:28,000 --> 00:15:35,000
And then that may or may not result in a higher failure rate. But it'd be interesting to see.

121
00:15:35,000 --> 00:15:42,000
OK, well, that's good. That's yeah, that's something I can I can try out and try and move forward with.

122
00:15:42,000 --> 00:15:51,000
So one thing I did want to point out was that in product made up 68 percent of the training data,

123
00:15:51,000 --> 00:15:59,000
whereas marijuana only made up 12 percent. So the classifiers seeing in product more.

124
00:15:59,000 --> 00:16:04,000
So it's able to learn more about it, learn it, you know, learn about it better.

125
00:16:04,000 --> 00:16:13,000
So that I believe that also is a factor in how you know, and on the numbers that came out.

126
00:16:13,000 --> 00:16:17,000
Because when you go down to.

127
00:16:17,000 --> 00:16:26,000
Harvest material, it predicted that it was the failure was 58 percent when actually it was seven percent.

128
00:16:26,000 --> 00:16:34,000
And that made up 13 percent of the data. So again, it didn't see as much of it.

129
00:16:34,000 --> 00:16:39,000
And these like microbial failures or.

130
00:16:39,000 --> 00:16:53,000
I guess we didn't really break that down, but I guess if it's harvest material, that must be microbes or it seems a little high.

131
00:16:53,000 --> 00:17:02,000
I don't know, because I only took data that you knew before the testing.

132
00:17:02,000 --> 00:17:11,000
Because if you started taking data that you knew about after the test happened, then you're peeking into the future.

133
00:17:11,000 --> 00:17:14,000
So.

134
00:17:14,000 --> 00:17:24,000
You know, you don't really know a lot, right? You know, the type, you know, the producer, you know, the lab.

135
00:17:24,000 --> 00:17:30,000
And you could find some things out about the producer, but.

136
00:17:30,000 --> 00:17:35,000
My thing is a predicting factor, but I'm saying.

137
00:17:35,000 --> 00:17:41,000
Like, as far as how you coded up failures.

138
00:17:41,000 --> 00:17:52,000
Was there like a variable that's used to mark overall failure status or are you going through each compound and comparing it to its limit?

139
00:17:52,000 --> 00:18:03,000
Well, in the in the lab data frame, the lab results data frame, there's actually a pass fail column.

140
00:18:03,000 --> 00:18:11,000
Interesting. So. This may get complicated, but.

141
00:18:11,000 --> 00:18:19,000
It could be interesting to to start breaking the failure down by what what's failing.

142
00:18:19,000 --> 00:18:23,000
So, for example.

143
00:18:23,000 --> 00:18:29,000
Concentrates are the only type of sample being tested for residual solvents.

144
00:18:29,000 --> 00:18:36,000
So it could be interesting to say, look at what's the chance of.

145
00:18:36,000 --> 00:18:41,000
Concentrate failing residual solvents.

146
00:18:41,000 --> 00:18:51,000
And then you could even break that down of what's the probability of it failing for each of the different solvent. So what's the probability that it fails for butane?

147
00:18:51,000 --> 00:18:55,000
What's the probability that it fails for propane?

148
00:18:55,000 --> 00:18:59,000
And so on and so forth.

149
00:18:59,000 --> 00:19:07,000
And then with flour, you may want to break it down by essentially the microbes.

150
00:19:07,000 --> 00:19:19,000
I don't remember them off the top of my head, but we've got the data set here.

151
00:19:19,000 --> 00:19:25,000
So they're testing for E. coli and salmonella. So.

152
00:19:25,000 --> 00:19:28,000
You could you could.

153
00:19:28,000 --> 00:19:39,000
Look at the failure rate for each of those. And so this the way you would do this is you would essentially have to calculate those failure rates.

154
00:19:39,000 --> 00:19:55,000
Manually by looking at the, for example, the microbial pathogenic E. coli variable and then compare that to the state mandated limit.

155
00:19:55,000 --> 00:20:04,000
Which may be 10,000 CFUs. So that's a colony forming unit per gram.

156
00:20:04,000 --> 00:20:12,000
Don't quote me. Don't quote me on that. So check the the whack.

157
00:20:12,000 --> 00:20:26,000
But long story short, we could potentially get more granular on the failure rates. So look at sample type or for different things.

158
00:20:26,000 --> 00:20:30,000
Okay.

159
00:20:30,000 --> 00:20:34,000
Yeah, it's something I can work on.

160
00:20:34,000 --> 00:20:44,000
There's so many variables there's that you have to kind of narrow it down. So, for example, what's the probability of a residual.

161
00:20:44,000 --> 00:20:50,000
So what I think would be interesting is, okay, just look at concentrates.

162
00:20:50,000 --> 00:20:59,000
So you may have to look at the different types of concentrates and then try to estimate the probability that they fail for mycotoxins.

163
00:20:59,000 --> 00:21:10,000
Just broadly for any mycotoxin. I think that's ochratoxin and aflatoxin.

164
00:21:10,000 --> 00:21:18,000
So you could just say, okay, what's the probability that residual solvent fails for a mycotoxin?

165
00:21:18,000 --> 00:21:23,000
And then what's the probability that it fails for a residual solvent?

166
00:21:23,000 --> 00:21:31,000
Just any of them. So whether that's butane or propane. So that's a bit more complicated analysis.

167
00:21:31,000 --> 00:21:41,000
But once again, it's just going to provide more information to to processors so that way they can know, okay, what's your bigger risk?

168
00:21:41,000 --> 00:21:55,000
Are mycotoxins your bigger risk? And then that would essentially come from biologically dirty. So just kind of dirty input.

169
00:21:55,000 --> 00:22:02,000
So that would be like, okay, check your input if your mycotoxin risk is high.

170
00:22:02,000 --> 00:22:11,000
And then if your residual solvent risk is high, then that means you want to address your process.

171
00:22:11,000 --> 00:22:22,000
So you want to address how you are getting rid of your solvent.

172
00:22:22,000 --> 00:22:29,000
How are you purging your solvent from your products, essentially?

173
00:22:29,000 --> 00:22:40,000
So sorry to go down that rabbit hole. But there's in a way there's endless there's endless analysis there, right?

174
00:22:40,000 --> 00:22:45,000
Because you can you can you can break that down a lot.

175
00:22:45,000 --> 00:22:49,000
Right. Okay. Well, that's something to get into.

176
00:22:49,000 --> 00:22:59,000
And I definitely have a lot of functions now to deal with this with this table and cleaning it up and pulling stuff out of it.

177
00:22:59,000 --> 00:23:08,000
Good. Good. Yeah. Well, you want to bring us home here, Charles, and then we can potentially get our hands on the data real quick.

178
00:23:08,000 --> 00:23:12,000
And I'll show you just you've done a lot more complex analysis than I have.

179
00:23:12,000 --> 00:23:22,000
But I've just put together a couple histograms of some of the cannabinoids and we can just look at the variables and talk about talk about the data.

180
00:23:22,000 --> 00:23:33,000
That's what we're here for. But anyways, would you bring us home here with with what you your main takeaways here and the rest of your analysis?

181
00:23:33,000 --> 00:23:44,000
Okay. Yeah. So, you know, intermediate product failed like 31 percent of the time, but in reality, it only fails like two percent.

182
00:23:44,000 --> 00:23:52,000
And but it only makes up seven percent of the data. So I think the you know,

183
00:23:52,000 --> 00:24:03,000
the reason that the end product, you know, you get the best you get the best predictions from it is that it makes up the majority of the data.

184
00:24:03,000 --> 00:24:17,000
So the classifier has enough data to learn something about it, and it doesn't have enough data about the other products to learn anything or to learn enough to make accurate predictions.

185
00:24:17,000 --> 00:24:26,000
I did spend most of the week working on trying to come up with a balanced data set.

186
00:24:26,000 --> 00:24:40,000
And I tried several different things with oversampling and even breaking the data up into passing and failing.

187
00:24:40,000 --> 00:24:50,000
And then oversampling that. And but what I'm finding, you know, I'm kind of finding this.

188
00:24:50,000 --> 00:24:57,000
I also found out that found the same kind of thing with the hemp data was that I'm not sure that there's anything there to predict.

189
00:24:57,000 --> 00:25:04,000
Right. These could just be sort of random occurrences or.

190
00:25:04,000 --> 00:25:13,000
Yeah, I mean, I don't know that there's actually enough if there's there's anything that's really predictable happening.

191
00:25:13,000 --> 00:25:20,000
I can kind of keep going down this path a little bit further, but.

192
00:25:20,000 --> 00:25:24,000
Yes, I'm not. Go ahead.

193
00:25:24,000 --> 00:25:37,000
I think, well, two things. So you definitely could definitely could be right. So we may have to think about another dimension of the data that's worth.

194
00:25:37,000 --> 00:25:40,000
That we could get some good insights from.

195
00:25:40,000 --> 00:25:54,000
My question is, my recommendation would be maybe even put prediction aside, just look at some conditional averages for the time being, but I would drill down a little a little further.

196
00:25:54,000 --> 00:25:59,000
So essentially what I would do is I would only look at end products.

197
00:25:59,000 --> 00:26:13,000
So let's take a sample type. So, well, like a cluster of sample types. So I would say concentrates and I've got the list pulled up here.

198
00:26:13,000 --> 00:26:18,000
So concentrates.

199
00:26:18,000 --> 00:26:29,000
Could be hydrocarbon concentrates, concentrates for inhalation, non solvent based concentrates.

200
00:26:29,000 --> 00:26:33,000
I would leave the mixes out.

201
00:26:33,000 --> 00:26:38,000
And then you've got CO2 concentrate.

202
00:26:38,000 --> 00:26:49,000
And then you may or may not want to include the food grade solvent concentrates.

203
00:26:49,000 --> 00:27:03,000
Or and the non solvent based concentrates, right, because you would to the non solvent based concentrate. So there's going to be like what they call bubble hash, which you would use essentially water as your solvent.

204
00:27:03,000 --> 00:27:10,000
So you wouldn't expect there to be butane in those concentrates.

205
00:27:10,000 --> 00:27:16,000
So, long story short, you may want to pick a cluster of concentrates.

206
00:27:16,000 --> 00:27:26,000
And then I would look or just just do one like just look at hydrocarbon concentrates. So just look at end product hydrocarbon concentrates.

207
00:27:26,000 --> 00:27:37,000
What's the probability of sailing for residual solvents and then potentially for the different residual solvents.

208
00:27:37,000 --> 00:27:48,000
And I think once you get that granular, I think that could actually provide useful insight as simply a conditional average to a process.

209
00:27:48,000 --> 00:28:07,000
So then you would just tell processors, hey, historically, hydrocarbon concentrates have an X percent chance of sailing for residual solvents.

210
00:28:07,000 --> 00:28:10,000
Similarly, you could do it with flour.

211
00:28:10,000 --> 00:28:15,000
And there your principal variable would be your microbes.

212
00:28:15,000 --> 00:28:33,000
So what's the chance if you grow flour, what's the probability your end product, your end flour, what's the probability that that flour will sail for microbial contaminants.

213
00:28:33,000 --> 00:28:41,000
And it may be hard to predict, right, because those may be like one percent or less.

214
00:28:41,000 --> 00:28:54,000
But I think just knowing what those percentages are could be useful for people planning their their cost structure, essentially.

215
00:28:54,000 --> 00:28:57,000
Right. Okay.

216
00:28:57,000 --> 00:29:06,000
You know, and I guess the other thing to take away from this is, you know, on this on the surface, this, you know, this classifier looked really good.

217
00:29:06,000 --> 00:29:12,000
But after we dug into it, you know, it really wasn't much better than a dummy classifier.

218
00:29:12,000 --> 00:29:17,000
So you should trust these things outright.

219
00:29:17,000 --> 00:29:20,000
And that's how you could potentially do some robustness checks.

220
00:29:20,000 --> 00:29:23,000
So just throw in a dummy classifier.

221
00:29:23,000 --> 00:29:31,000
And so I think it's fruitful, interesting analysis, right, because.

222
00:29:31,000 --> 00:29:36,000
It's better to do it than to not do it right. So I think we've learned a lot.

223
00:29:36,000 --> 00:29:39,000
So it's.

224
00:29:39,000 --> 00:29:43,000
You're OK, Charles. I'll go ahead and start presenting.

225
00:29:43,000 --> 00:29:48,000
And that way we can see some of these variables that we've been talking about.

226
00:29:48,000 --> 00:29:52,000
OK. So.

227
00:29:52,000 --> 00:30:03,000
Just to show you what I've been looking at over here. So these are essentially the intermediate types that I've identified.

228
00:30:03,000 --> 00:30:12,000
Just to kind of give you a bit of a background before we just dive straight into that.

229
00:30:12,000 --> 00:30:22,000
So. Essentially, right, we're working with this Washington state data.

230
00:30:22,000 --> 00:30:25,000
Primarily looking at the cannabinoids.

231
00:30:25,000 --> 00:30:31,000
Well, the lab results in general, not necessarily the cannabinoids.

232
00:30:31,000 --> 00:30:36,000
And so. I essentially wanted to do a quick analysis here.

233
00:30:36,000 --> 00:30:43,000
And so. There's been a lot of talk about Delta 8 THC.

234
00:30:43,000 --> 00:30:48,000
So I just wanted to take a quick look at Delta 8 THC.

235
00:30:48,000 --> 00:30:58,000
And then I was just perusing the literature to see if what's the latest research being done.

236
00:30:58,000 --> 00:31:02,000
Found an article here.

237
00:31:02,000 --> 00:31:08,000
Let's see when was this published? Published not too long ago.

238
00:31:08,000 --> 00:31:12,000
Oh, yeah. So it was just published July 10th.

239
00:31:12,000 --> 00:31:18,000
And so essentially what they did was they just looked at edibles in Jamaica.

240
00:31:18,000 --> 00:31:22,000
And so these.

241
00:31:22,000 --> 00:31:25,000
Not necessarily.

242
00:31:25,000 --> 00:31:33,000
From stores. So some of these were confiscated from from high school students.

243
00:31:33,000 --> 00:31:39,000
So these were just. I think they only had around 50 or so.

244
00:31:39,000 --> 00:31:44,000
So they basically had around 50 edibles that they came by.

245
00:31:44,000 --> 00:31:54,000
And they were looking at the THC and CBD.

246
00:31:54,000 --> 00:32:01,000
Found in the edibles. And so they're looking at the THC CBD ratio.

247
00:32:01,000 --> 00:32:05,000
And so I thought, well, you know, they only have 50 edibles.

248
00:32:05,000 --> 00:32:11,000
We probably have a lot more edibles in the lab results data set.

249
00:32:11,000 --> 00:32:23,000
So why don't we try to to essentially replicate their analysis and see what we find with edibles here in Washington state.

250
00:32:23,000 --> 00:32:28,000
So that's just a bit of background.

251
00:32:28,000 --> 00:32:36,000
About where I'm coming from, just to build on Charles's analysis.

252
00:32:36,000 --> 00:32:40,000
OK.

253
00:32:40,000 --> 00:32:48,000
So let's see if we can't.

254
00:32:48,000 --> 00:32:54,000
Get a terminal opened here.

255
00:32:54,000 --> 00:32:59,000
So it'll give me about 30 seconds to read in the data.

256
00:32:59,000 --> 00:33:05,000
So we're working with these lab results from Washington state.

257
00:33:05,000 --> 00:33:09,000
So we've got about two gigabytes of.

258
00:33:09,000 --> 00:33:19,000
Lab results here and just for a refresher, this is the same data set that we've been working with for some time now.

259
00:33:19,000 --> 00:33:27,000
And then I'll be working to get an updated copy of this data set to the most recent time period.

260
00:33:27,000 --> 00:33:34,000
So that way we can refresh our analysis.

261
00:33:34,000 --> 00:33:40,000
OK, looks like we've read in the data here. So first things first.

262
00:33:40,000 --> 00:33:44,000
Just want to look at the data.

263
00:33:44,000 --> 00:33:48,000
So.

264
00:33:48,000 --> 00:33:57,000
We have almost two million observations.

265
00:33:57,000 --> 00:34:01,000
To go ahead and show you.

266
00:34:01,000 --> 00:34:05,000
About the sample types here. So first off.

267
00:34:05,000 --> 00:34:14,000
Just to show you the data points we're working with.

268
00:34:14,000 --> 00:34:17,000
There's a lot of.

269
00:34:17,000 --> 00:34:22,000
Analytes, so pesticides, solvents.

270
00:34:22,000 --> 00:34:29,000
Microbes, terpenes, well there's not actual terpenes, cannabinoids.

271
00:34:29,000 --> 00:34:35,000
OK, so the main identifiers are essentially type.

272
00:34:35,000 --> 00:34:48,000
And intermediate type and to show you the guidebook links up here.

273
00:34:48,000 --> 00:34:53,000
OK, so returning to our type discussion.

274
00:34:53,000 --> 00:34:58,000
So essentially there's several broad types.

275
00:34:58,000 --> 00:35:02,000
So.

276
00:35:02,000 --> 00:35:06,000
The end products are what end up on the shelves.

277
00:35:06,000 --> 00:35:13,000
Intermediate products could potentially be processed into other.

278
00:35:13,000 --> 00:35:18,000
End products or become end products themselves.

279
00:35:18,000 --> 00:35:24,000
So.

280
00:35:24,000 --> 00:35:30,000
What I find most useful for determining what the sample is.

281
00:35:30,000 --> 00:35:34,000
Is essentially its intermediate type.

282
00:35:34,000 --> 00:35:40,000
And so this is technically the subcategory of inventory type.

283
00:35:40,000 --> 00:35:45,000
And so it's conditional on the type.

284
00:35:45,000 --> 00:35:52,000
So.

285
00:35:52,000 --> 00:35:57,000
This is where we start to lose a little bit of the rhyme and reason.

286
00:35:57,000 --> 00:35:59,000
So for example.

287
00:35:59,000 --> 00:36:09,000
The concentrates are always your intermediate types or always your intermediate products.

288
00:36:09,000 --> 00:36:15,000
Your edibles are always.

289
00:36:15,000 --> 00:36:24,000
End products it looks like.

290
00:36:24,000 --> 00:36:31,000
So it's a little bit of a mess, but essentially if you just look at.

291
00:36:31,000 --> 00:36:37,000
Just the intermediate types.

292
00:36:37,000 --> 00:36:45,000
It gives you a decent understanding of what the product actually is.

293
00:36:45,000 --> 00:36:47,000
For example.

294
00:36:47,000 --> 00:36:55,000
If you just look at the intermediate types for hydrocarbon concentrates, you'll just get hydrocarbons.

295
00:36:55,000 --> 00:37:08,000
Or it's a little bit tricky. I was looking at the data this morning and it doesn't appear at first glance like there's a big distinction between flower and flowered lots.

296
00:37:08,000 --> 00:37:21,000
I've got a sneaking suspicion that the flower is intended for processing and flowered lots is intended directly for sale.

297
00:37:21,000 --> 00:37:29,000
In my in my analysis here, I just combined flower and flower lots and just called that flower data.

298
00:37:29,000 --> 00:37:34,000
And so for now, I'll just be looking at flower data.

299
00:37:34,000 --> 00:37:40,000
So flower lots is.

300
00:37:40,000 --> 00:37:47,000
An intermediate type for end product and flower is an intermediate type for marijuana.

301
00:37:47,000 --> 00:37:56,000
If you look at my version one of that notebook, it kind of it breaks that down and breaks it down by percentage.

302
00:37:56,000 --> 00:38:04,000
And actually, flower is the only subtype for.

303
00:38:04,000 --> 00:38:10,000
For marijuana. There's a bunch of others listed, but.

304
00:38:10,000 --> 00:38:14,000
Flowers like at 100%.

305
00:38:14,000 --> 00:38:19,000
Exactly. And so maybe even be worth asking the LCB themselves.

306
00:38:19,000 --> 00:38:24,000
But essentially, I've got a sneaking suspicion.

307
00:38:24,000 --> 00:38:35,000
One is like one is designated for stores and the other could potentially end up at stores or could potentially end up at a processor.

308
00:38:35,000 --> 00:38:39,000
It looks a lot like they treat.

309
00:38:39,000 --> 00:38:44,000
The two types the same. So.

310
00:38:44,000 --> 00:38:50,000
Once again, just sort of deferring to what it brought up in the last meetup where this is not.

311
00:38:50,000 --> 00:38:56,000
This is not a bulletproof analysis here. So if you were doing.

312
00:38:56,000 --> 00:39:06,000
You know, research on your own or, you know, commissioned research, you're going to want to do a lot more research and dive in and answer these questions. So.

313
00:39:06,000 --> 00:39:17,000
For example, wouldn't hurt to just email the Washington State LCB and ask, OK, what exactly is the difference between flower and flower lots?

314
00:39:17,000 --> 00:39:20,000
They may or may not have.

315
00:39:20,000 --> 00:39:25,000
An answer, but.

316
00:39:25,000 --> 00:39:36,000
For now, we're just doing expedient analysis, but it's worth reading up a bit more because, like I said, I'm not 100% certain.

317
00:39:36,000 --> 00:39:40,000
And I need to become certain. So.

318
00:39:40,000 --> 00:39:48,000
So, so long story short, if you have any insights, definitely let everybody know.

319
00:39:48,000 --> 00:39:52,000
Just keep powering on here.

320
00:39:52,000 --> 00:40:00,000
I'll just be looking at flower and flower lots will worry about the distinction leader and.

321
00:40:00,000 --> 00:40:02,000
Out of the samples.

322
00:40:02,000 --> 00:40:17,000
Out of the almost 2 million, there is about 239,000 flowered lots. So I think Charles, this is may have been what you were hitting at where.

323
00:40:17,000 --> 00:40:22,000
There's a lot of data there that may not necessarily.

324
00:40:22,000 --> 00:40:27,000
Have lab results. Is that the case? Is that what you're finding or.

325
00:40:27,000 --> 00:40:35,000
There's lab results, but like the intermediate type isn't filled in.

326
00:40:35,000 --> 00:40:40,000
Yes, and so.

327
00:40:40,000 --> 00:40:45,000
That's it. That's interesting. So we need to admit and find more. So I wonder if.

328
00:40:45,000 --> 00:40:53,000
Those are observations that were just missing an intermediate type, or if those are.

329
00:40:53,000 --> 00:41:01,000
Other types of samples, so I think I think there's more discovery to be had here.

330
00:41:01,000 --> 00:41:08,000
But just to keep powering through dot to get bogged down and so essentially I'll run through this analysis.

331
00:41:08,000 --> 00:41:20,000
Prefacing it with the fact that we still need to find out a bit more about the variables. So this analysis should be repeated once we've learned a bit more about the data.

332
00:41:20,000 --> 00:41:27,000
But just to demonstrate some of the techniques we can use.

333
00:41:27,000 --> 00:41:36,000
So first research question, I was just curious about Delta 8 THC because.

334
00:41:36,000 --> 00:41:41,000
This was actually had a discussion the other day at the.

335
00:41:41,000 --> 00:41:50,000
Washington State Liquor and Cannabis Board deliberative dialogue. They were talking about Delta 8 and synthetic cannabinoids.

336
00:41:50,000 --> 00:41:59,000
I was curious. Okay, so what does the presence of Delta 8 THC look like in an ordinary flower?

337
00:41:59,000 --> 00:42:12,000
So out of the almost 240,000 flower samples, a little less than 2000 had Delta 8 THC.

338
00:42:12,000 --> 00:42:27,000
If you look at the actual percentage, it's less than 1%. So about 0.75% of all the flower samples tested had Delta 8 THC.

339
00:42:27,000 --> 00:42:35,000
So it appears to be a rare compound.

340
00:42:35,000 --> 00:42:39,000
And essentially.

341
00:42:39,000 --> 00:42:44,000
I've restricted.

342
00:42:44,000 --> 00:42:56,000
The sample to exclude outliers because so, for example, let's just take a quick description.

343
00:42:56,000 --> 00:42:58,000
So.

344
00:42:58,000 --> 00:43:14,000
You'll notice, okay, the mean is around 1%. However, we've got an observation in there that's coded at 64% Delta 8 THC and.

345
00:43:14,000 --> 00:43:29,000
I don't think that's the case, so it's a flowers sample, so I think there may be either miscoding or some sort of outliers in the data.

346
00:43:29,000 --> 00:43:33,000
So I gave it a generous.

347
00:43:33,000 --> 00:43:37,000
5% exclusion of outliers.

348
00:43:37,000 --> 00:43:42,000
So I'm restricting it to the bottom 95 percentile.

349
00:43:42,000 --> 00:43:49,000
And just to look at that data.

350
00:43:49,000 --> 00:43:55,000
Here would be your distribution or your density.

351
00:43:55,000 --> 00:43:59,000
Of Delta 8 THC and flower.

352
00:43:59,000 --> 00:44:01,000
So.

353
00:44:01,000 --> 00:44:19,000
And then this is actually of the flower that contains Delta 8 THC. So this chart would have a lot more zeros. If we included all the flower that didn't have Delta 8 THC.

354
00:44:19,000 --> 00:44:21,000
So.

355
00:44:21,000 --> 00:44:35,000
I think it's interesting to observe. So it just is this this rare compound and there could be more work to be done there to, for example, you know, what are some of the.

356
00:44:35,000 --> 00:44:39,000
This could be a huge data dump.

357
00:44:39,000 --> 00:44:42,000
But.

358
00:44:42,000 --> 00:44:52,000
You know, what are some of the strains.

359
00:44:52,000 --> 00:44:56,000
I thought there was a straining variable.

360
00:44:56,000 --> 00:45:03,000
But anyways, what are some of the strains that, you know, are more common.

361
00:45:03,000 --> 00:45:07,000
Maybe it's product.

362
00:45:07,000 --> 00:45:12,000
Now to get the strain, you have to merge it with the batch.

363
00:45:12,000 --> 00:45:16,000
And then and then merge and then merge that with the strain.

364
00:45:16,000 --> 00:45:22,000
That's right. So you have to do a bit of leg work to attach those data points.

365
00:45:22,000 --> 00:45:38,000
I think that could be that could be fruitful legwork. So, for example, what are the sample types that have Delta 8 THC because as we saw, there's only about 1800 of them.

366
00:45:38,000 --> 00:45:57,000
And there could be duplicates, right? So there could be one strain that's grown many, many times. So I'm curious. OK, so what are these strains that are growing that are producing Delta 8 THC and.

367
00:45:57,000 --> 00:46:05,000
There's also a surprisingly large number of entries that are missing the strain ID.

368
00:46:05,000 --> 00:46:12,000
Exactly. So this could be where it's just essentially.

369
00:46:12,000 --> 00:46:18,000
Not the best data entry, so.

370
00:46:18,000 --> 00:46:26,000
Different licensees may not be consistent about how they're entering in strains and tight, although.

371
00:46:26,000 --> 00:46:30,000
You know, the system, I'm sure, would like like things to be nice and consistent.

372
00:46:30,000 --> 00:46:35,000
So.

373
00:46:35,000 --> 00:46:41,000
That's the reality of working with, you know, with real data is.

374
00:46:41,000 --> 00:46:49,000
It can be a mess, right?

375
00:46:49,000 --> 00:47:02,000
So we need to be careful with the data points because I was hearing somebody say, oh, there's an acronym, Geigo, where it's garbage in garbage out. So.

376
00:47:02,000 --> 00:47:10,000
You have to be real careful about, you know, what data points we're looking at here and so.

377
00:47:10,000 --> 00:47:26,000
You know, you may have to suffer in this case, I'm excluding the outliers because I just personally don't believe that there's a sample that actually has 64% Delta 8 THC.

378
00:47:26,000 --> 00:47:29,000
So.

379
00:47:29,000 --> 00:47:34,000
That's the reality of the situation.

380
00:47:34,000 --> 00:47:41,000
But just to keep moving on, unless you've got some more observations Charles or Brooke.

381
00:47:41,000 --> 00:47:43,000
So.

382
00:47:43,000 --> 00:47:47,000
Just do. I don't have any questions. Okay.

383
00:47:47,000 --> 00:47:55,000
So that was just a little look I wanted to do it Delta 8 THC just because there's been a bit of noise about that.

384
00:47:55,000 --> 00:48:15,000
Next, just to keep doing some cannabinoid analysis. I thought, okay, we could reproduce this histogram. This is simply a histogram of the THC CBD ratio of the solid edibles in Jamaica.

385
00:48:15,000 --> 00:48:18,000
A sample of solid edibles in Jamaica.

386
00:48:18,000 --> 00:48:23,000
So we can do that in Washington State.

387
00:48:23,000 --> 00:48:25,000
So.

388
00:48:25,000 --> 00:48:36,000
Going back to the sample types.

389
00:48:36,000 --> 00:48:43,000
You'll see that there are a handful of edible classifications.

390
00:48:43,000 --> 00:48:55,000
There's topicals, there's capsules, there's solid edibles, there's tinctures, there's cooking mediums, there's liquid edibles.

391
00:48:55,000 --> 00:49:04,000
There's transdermal patches. So there's, there's a handful of non-traditional cannabis products.

392
00:49:04,000 --> 00:49:10,000
Just for the sake of simplicity.

393
00:49:10,000 --> 00:49:15,000
I'm just doing analysis here on solid edibles.

394
00:49:15,000 --> 00:49:33,000
Because I believe if you look at the breakdown of the edibles that they are looking at in Jamaica, you'll see they've got baked goods and candies.

395
00:49:33,000 --> 00:49:39,000
They do have some beverages. So we may want to include liquid edibles.

396
00:49:39,000 --> 00:49:46,000
However, for the most part, it looks like baked goods, chocolates.

397
00:49:46,000 --> 00:49:57,000
So in our data, solid edible is 37% of the intermediate types for end product.

398
00:49:57,000 --> 00:50:02,000
And liquid edible is only 10%.

399
00:50:02,000 --> 00:50:18,000
So honestly, let's do this now. We've got time. So let's do this analysis real quick with solid edibles, liquid edibles, and then solid and liquid edibles.

400
00:50:18,000 --> 00:50:20,000
So.

401
00:50:20,000 --> 00:50:26,000
Just to do it real quick, just with solid edibles.

402
00:50:26,000 --> 00:50:34,000
So I saw that there were about 9000 solid edible observations.

403
00:50:34,000 --> 00:50:44,000
And so this is what's so cool about our analysis here in the Cannabis Data Science Meetup Group is this paper.

404
00:50:44,000 --> 00:50:47,000
Right, so.

405
00:50:47,000 --> 00:50:48,000
Right.

406
00:50:48,000 --> 00:50:51,000
These authors.

407
00:50:51,000 --> 00:50:57,000
Have their article published in the Journal of Cannabis Research.

408
00:50:57,000 --> 00:51:06,000
And it is incredibly interesting because we want to know about the cannabis in Jamaica.

409
00:51:06,000 --> 00:51:11,000
However, got to keep in mind that.

410
00:51:11,000 --> 00:51:17,000
Right, actually they don't even have 50s. They've got 45.

411
00:51:17,000 --> 00:51:23,000
Edibles collected over a four year period. And so.

412
00:51:23,000 --> 00:51:33,000
I mean, for starters, you could argue that this data is already dated. I mean, it's 2021.

413
00:51:33,000 --> 00:51:47,000
So is it even relevant to compare edibles in 2021 to edibles in 2015?

414
00:51:47,000 --> 00:51:55,000
Maybe, you know, maybe not. And so I think that's what's cool about our analysis here is.

415
00:51:55,000 --> 00:52:03,000
You know, a lot of the times your access to data is limited, right? So they're just looking at 45 samples and.

416
00:52:03,000 --> 00:52:06,000
You know, they're good.

417
00:52:06,000 --> 00:52:14,000
That's fruitful, right? Because you'll you still get a breakdown of the THC-CBD ratio.

418
00:52:14,000 --> 00:52:21,000
We can collect data in Washington State. It's public. It's free. Anybody can collect it.

419
00:52:21,000 --> 00:52:31,000
We've collected it and we have 9000 solid edible samples between.

420
00:52:31,000 --> 00:52:48,000
And so we can even find our date range here. So.

421
00:52:48,000 --> 00:52:53,000
I should have just done this on.

422
00:52:53,000 --> 00:53:02,000
Wonder if I can.

423
00:53:02,000 --> 00:53:16,000
Well, luckily I have this over here too.

424
00:53:16,000 --> 00:53:19,000
OK.

