1
00:00:00,000 --> 00:00:12,600
Sorry about that.

2
00:00:12,600 --> 00:00:16,000
Yeah.

3
00:00:16,000 --> 00:00:19,240
Well, welcome everyone to the cannabis data science meet up group.

4
00:00:19,240 --> 00:00:29,200
This is a group where we meet at least once a week to talk about cannabis data and how we can apply data analytics.

5
00:00:29,200 --> 00:00:32,560
Here's a guest, Heather, who joins us regularly.

6
00:00:32,560 --> 00:00:34,920
And so my name is Keegan.

7
00:00:34,920 --> 00:00:40,760
I founded a company, Canlytics, and we're here to make cannabis testing simple and easy.

8
00:00:40,760 --> 00:00:46,080
And while doing so, we can provide rich analytics to everyone in the industry.

9
00:00:46,080 --> 00:00:48,320
So that's my comparative advantage.

10
00:00:48,320 --> 00:00:50,920
So I'm always just here to share.

11
00:00:50,920 --> 00:00:57,040
And I'll let Charles, would you mind introducing yourself for a quick?

12
00:00:57,040 --> 00:00:58,480
Hi, I'm Charles.

13
00:00:58,480 --> 00:01:07,480
I have 27 years of software engineering experience, and I've been transitioning into the data science area.

14
00:01:07,480 --> 00:01:15,520
I have experience with TensorFlow and machine learning, genetic algorithms.

15
00:01:15,520 --> 00:01:25,480
And this has been a great group to get some practice and work with some interesting data.

16
00:01:25,480 --> 00:01:31,800
Definitely. And then if everyone, if they want to, if you want to go, I'll go around.

17
00:01:31,800 --> 00:01:35,680
If you want to share a bit about yourself, feel free to share in, jump in.

18
00:01:35,680 --> 00:01:39,280
So I guess next is Alan.

19
00:01:39,280 --> 00:01:42,160
If you're interested, would you like to introduce yourself?

20
00:01:42,160 --> 00:01:54,920
Yeah, hi. So I'm Alan, and I'm transitioning into the field of data science from being a science educator for 15 years and graduated from the Galvanized Data Science Bootcamp.

21
00:01:54,920 --> 00:02:02,160
And yeah, working on improving my skills and trying to get a job.

22
00:02:02,160 --> 00:02:10,360
Excellent. We have a pretty regular guest who is on vacation this week or be back next week, Paul.

23
00:02:10,360 --> 00:02:14,520
And he's also getting a degree in data science.

24
00:02:14,520 --> 00:02:25,840
And so what we found is the cannabis data is some of the, you know, some of the richest interesting data sets out there,

25
00:02:25,840 --> 00:02:33,160
because, you know, especially with the traceability systems, you have such granular, high frequency data.

26
00:02:33,160 --> 00:02:40,160
So it's fun to do analysis on regardless of your background.

27
00:02:40,160 --> 00:02:44,600
Matt, would you, if you're interested, would you want to introduce yourself real quick?

28
00:02:44,600 --> 00:02:51,840
Hi, everyone. I'm Matt. And I work with ERP system called Plex.

29
00:02:51,840 --> 00:02:58,800
And I write reports using Intel Plex and SQL development.

30
00:02:58,800 --> 00:03:14,240
So I'm really good at SQL and I'm interested in this group to access the APIs to write, I guess, RESTful services to access the RESTful data

31
00:03:14,240 --> 00:03:19,600
and to manipulate it like in Tableau or something else.

32
00:03:19,600 --> 00:03:22,440
That's it. Thank you. Awesome. Awesome.

33
00:03:22,440 --> 00:03:26,280
You're in the right place. So we're we've touched on APIs.

34
00:03:26,280 --> 00:03:30,920
So I put up a couple of videos on YouTube where we go through some API work.

35
00:03:30,920 --> 00:03:41,480
But Canlytics has put together an API primarily for labs to use, but anyone can use it to store and access their lab results.

36
00:03:41,480 --> 00:03:48,640
And then Canlytics also has a module to interface with metric and leaf data systems,

37
00:03:48,640 --> 00:03:57,360
both of which have an API that serves the traceability systems in their respective states.

38
00:03:57,360 --> 00:04:02,560
So you're in the right place. OK, good to hear. Thank you.

39
00:04:02,560 --> 00:04:07,320
Awesome. Kelly, were you interested in introducing yourself real quick?

40
00:04:07,320 --> 00:04:11,240
Yeah, I don't think I'm in the right place, actually.

41
00:04:11,240 --> 00:04:18,160
I'm Kelly and I'm using CBD right now.

42
00:04:18,160 --> 00:04:23,520
So I was curious. But you guys probably are going to roll right over my head.

43
00:04:23,520 --> 00:04:31,280
Well, well, you may want to stick around and listen, because one of the things I put together today is so first,

44
00:04:31,280 --> 00:04:39,360
we may look at some work Charles has done just predicting when a sample would fail quality assurance testing.

45
00:04:39,360 --> 00:04:44,040
So that affects you. That's when products make it to the shelf.

46
00:04:44,040 --> 00:04:50,360
And then the second thing is we're going to be looking at hemp testing data.

47
00:04:50,360 --> 00:04:57,800
So people who are growing hemp, they need to get it tested to make sure that it doesn't have high levels of THC,

48
00:04:57,800 --> 00:05:05,360
but then it predominantly has CBD. So they need to get that tested before it makes it to the shelf for you.

49
00:05:05,360 --> 00:05:13,320
So we're sort of doing analysis upstream from from your products to make sure that what you get

50
00:05:13,320 --> 00:05:17,560
is what it says it is on the label. That's the idea.

51
00:05:17,560 --> 00:05:21,520
Yeah. OK.

52
00:05:21,520 --> 00:05:27,520
And then, Heather, are you interested in introducing yourself or?

53
00:05:27,520 --> 00:05:33,680
Yes, sure. I'm Heather. I've been joining Cannabis Data Science for a couple of weeks now.

54
00:05:33,680 --> 00:05:37,880
I'm going to confess to you that I don't have data to share and no code to share.

55
00:05:37,880 --> 00:05:45,160
I am currently looking for a job. My goal is at some point soon to be working in a cannabis lab, testing something.

56
00:05:45,160 --> 00:05:53,520
I would I'm trying very hard to bring my QC experience to to light in the industry.

57
00:05:53,520 --> 00:06:04,040
So anyway, I'm observing a lot and I have a lot of questions, but I'll reserve those after the presentation.

58
00:06:04,040 --> 00:06:09,200
So thanks. Well, awesome, Heather. I may have to forward you some some job openings

59
00:06:09,200 --> 00:06:13,480
because there's lots of laboratories that are looking for for people right now.

60
00:06:13,480 --> 00:06:16,360
So thank you. You're in the right place.

61
00:06:16,360 --> 00:06:21,080
And you're you're welcome to to learn from us.

62
00:06:21,080 --> 00:06:28,360
Use it. Use data and use any snippets of code that you may find useful.

63
00:06:28,360 --> 00:06:36,000
I question data, too. So you're talking about the quality assurance and the levels of anything that's tested.

64
00:06:36,000 --> 00:06:39,800
I'm finding that at least with some of the products. So I live in Maryland, by the way.

65
00:06:39,800 --> 00:06:46,960
So some of the products that I've been using over the last year, their tests, the terpene levels have changed.

66
00:06:46,960 --> 00:06:51,040
Yet it's hitting the same. So and there's a lot of discussion on Reddit.

67
00:06:51,040 --> 00:06:55,320
People complaining about the turp levels not reaching above three percent.

68
00:06:55,320 --> 00:07:02,840
That's a very local issue. But so anyway, testing is an issue for somebody that makes homemade oils and everything.

69
00:07:02,840 --> 00:07:09,400
We're interested in what we're putting out to our even ourselves, our local market.

70
00:07:09,400 --> 00:07:13,120
What if you want to think about it that way? So numbers do interest me.

71
00:07:13,120 --> 00:07:19,320
And why they they change without a apparent physical change is something of concern.

72
00:07:19,320 --> 00:07:26,480
So I like to question data when I can, but that's not just for, you know, for entertainment value.

73
00:07:26,480 --> 00:07:31,040
It's for the sake of the product, the quality and what we're using it for.

74
00:07:31,040 --> 00:07:34,920
So, you know, therapy, pain management, anything that you need.

75
00:07:34,920 --> 00:07:40,360
We do want to know what's in it and and why. So thank you.

76
00:07:40,360 --> 00:07:43,640
You've got an inquisitive mind. So you're in the right place.

77
00:07:43,640 --> 00:07:46,120
Thank you. I feel welcome. Thank you.

78
00:07:46,120 --> 00:07:52,240
Awesome. Welcome, David. Real quick, we were just doing a round of introductions.

79
00:07:52,240 --> 00:08:01,280
You would actually happen to be the last. If you're interested, are you would you mind introducing yourself real quick or we can kick it off either way?

80
00:08:01,280 --> 00:08:07,760
Yeah, no problem. Sorry, I'm actually at the gym, but I wanted to definitely not miss this.

81
00:08:07,760 --> 00:08:15,000
Yeah. So I recently got into data science, have a background in mainly business, project management.

82
00:08:15,000 --> 00:08:28,000
And I also have a project upcoming that I want to do a capstone project on just finding different data sets and then showing the skills that we've kind of recently went over, which is Python, Tableau, SQL.

83
00:08:28,000 --> 00:08:32,960
And so this is very appealing to me. So I wanted to hear a little bit more about it.

84
00:08:32,960 --> 00:08:38,200
I've also been interested in part time investing the stock market.

85
00:08:38,200 --> 00:08:51,440
So I figured it would be a great way to just get a little bit more solid foundation, get enough, you know, just become more educated in the space.

86
00:08:51,440 --> 00:08:56,560
Awesome. Well, welcome, David. And so welcome, everybody.

87
00:08:56,560 --> 00:09:05,160
And so let's just go ahead and kick it off. And so for starters, I think I can present unless you want to present Charles.

88
00:09:05,160 --> 00:09:15,840
But I think if you're open, we can start showing some of the work that Charles has done predicting the probability that a sample will fail quality assurance testing.

89
00:09:15,840 --> 00:09:23,200
So when the sample sent in, depending on its attributes, you know, what's the chance that it's going to fail?

90
00:09:23,200 --> 00:09:31,520
So can you can you start to predict that ahead of time, essentially?

91
00:09:31,520 --> 00:09:33,720
Do you want to present Charles or should I?

92
00:09:33,720 --> 00:09:51,120
You can present. OK, awesome.

93
00:09:51,120 --> 00:10:01,960
OK, so I, you know, I looked at like, how could you tell using the Washington data if a sample would fail?

94
00:10:01,960 --> 00:10:17,040
And so in order to do this, you have to you know, you have you know, you have data that's from the past, you know, like the type and the date it was test or well,

95
00:10:17,040 --> 00:10:21,400
the type and the lab ID and the producer ID.

96
00:10:21,400 --> 00:10:24,160
You have also data from the future from after it was tested.

97
00:10:24,160 --> 00:10:30,200
So you can't use any of that because you can't peek into the future to make a prediction.

98
00:10:30,200 --> 00:10:41,080
So so I just read those things and, you know, and tried to use those.

99
00:10:41,080 --> 00:10:48,040
So, yeah, I mean, there's a there's a lot of data there from the future.

100
00:10:48,040 --> 00:11:02,360
But so, yeah, so I read that stuff in and then you can keep scrolling down to.

101
00:11:02,360 --> 00:11:10,800
Yeah, and I went through and I kind of checked and got rid of data that was missing data or tried to fill in data.

102
00:11:10,800 --> 00:11:17,040
And so the one thing that you have in that data is the product type, which you know ahead of time.

103
00:11:17,040 --> 00:11:23,440
And I looked at each type of each of the product types and there's actually less than one percent failure for each one.

104
00:11:23,440 --> 00:11:26,160
I mean, the really, really small value.

105
00:11:26,160 --> 00:11:33,560
So this is going to be kind of, you know, so kind of difficult to to find.

106
00:11:33,560 --> 00:11:40,440
And then also, there's also it's around a one percent failure rate.

107
00:11:40,440 --> 00:11:49,400
So you have a lot of passing values, but not a lot of failures.

108
00:11:49,400 --> 00:11:52,280
Yeah. So so that makes it makes it difficult.

109
00:11:52,280 --> 00:11:56,440
And I went through so there's a type and there's an intermediate type.

110
00:11:56,440 --> 00:12:02,760
And I tried to go through intermediate types to see if that would be useful.

111
00:12:02,760 --> 00:12:16,040
So, sorry, but there's a lot of missing values and it was very, you know, there wasn't it wasn't a straightforward way to fill in the values.

112
00:12:16,040 --> 00:12:19,160
And for some of and for some of the types, there was only one value.

113
00:12:19,160 --> 00:12:24,920
So they would just be, you know, one to one correlation and it wouldn't really add the information.

114
00:12:24,920 --> 00:12:35,760
So I ended up dropping the the the intermediate types.

115
00:12:35,760 --> 00:12:45,640
Yes. And so it looks like the ones you want to focus on here are like the.

116
00:12:45,640 --> 00:12:50,400
I think the flower lots are what's getting tested.

117
00:12:50,400 --> 00:12:58,440
The flower may actually just be like wet flower that may be getting transferred for processing.

118
00:12:58,440 --> 00:13:06,240
And need to double check on that. And then, of course, you want the breakdown of the concentrates.

119
00:13:06,240 --> 00:13:16,960
So. But some of the, you know, it may not be worthwhile looking at some things like capsules and whatnot, but it depends on the data.

120
00:13:16,960 --> 00:13:23,280
But it seems our biggest challenge here is just the prevalence of of zeros.

121
00:13:23,280 --> 00:13:28,200
Right. So the prevalence of samples that pass make it hard to predict.

122
00:13:28,200 --> 00:13:33,040
The black swan. And these numbers are percentages.

123
00:13:33,040 --> 00:13:38,280
So they're really, you know, really small.

124
00:13:38,280 --> 00:13:43,080
Or no, I guess it's no, it's normalized. But yeah, so but it is.

125
00:13:43,080 --> 00:13:46,200
Are these failure rates right here or that's no.

126
00:13:46,200 --> 00:13:53,520
Those are the number of samples that there are among the total sample.

127
00:13:53,520 --> 00:13:57,640
So is this saying there's 50 percent of the samples are flowered lots?

128
00:13:57,640 --> 00:14:09,760
Yeah. But 84 percent of the of the rows had no information.

129
00:14:09,760 --> 00:14:14,760
There were no there was no intermediate type listed.

130
00:14:14,760 --> 00:14:18,560
So it was kind of, you know, how do you fill that in?

131
00:14:18,560 --> 00:14:25,400
And I couldn't come up with a good way to fill in that missing data.

132
00:14:25,400 --> 00:14:31,080
OK, that's a that's a big that's a large percentage.

133
00:14:31,080 --> 00:14:44,600
So did they those have lab results that 84 percent? Yes.

134
00:14:44,600 --> 00:14:48,440
Did they have a product type?

135
00:14:48,440 --> 00:14:52,880
Like they were sure they were just all in products, I'm thinking.

136
00:14:52,880 --> 00:14:59,080
Yeah, there was some sort of there was a product type there, but there was no intermediate type.

137
00:14:59,080 --> 00:15:03,520
Well, it's a chance they just weren't in products, but.

138
00:15:03,520 --> 00:15:06,880
But it could be just bad.

139
00:15:06,880 --> 00:15:13,880
It could be missing data. So.

140
00:15:13,880 --> 00:15:22,080
Right. And there was just no way to figure out, you know, a good way to fill it in without just.

141
00:15:22,080 --> 00:15:25,520
You know, without sort of corrupting it more.

142
00:15:25,520 --> 00:15:28,200
So I just didn't use the intermediate type.

143
00:15:28,200 --> 00:15:32,480
I think the intermediate type would have really been helpful.

144
00:15:32,480 --> 00:15:42,440
But it was there was, you know, there wasn't a clear path forward on how to fill in that missing data.

145
00:15:42,440 --> 00:15:48,720
So instead, did you go with.

146
00:15:48,720 --> 00:15:59,080
So I just use the product type because that I had, you know, that was all there was a product type for filled in for almost all the rows.

147
00:15:59,080 --> 00:16:04,880
And so I didn't have to drop that much data for that.

148
00:16:04,880 --> 00:16:09,000
Right. And so for like marijuana, there was flour was the only intermediate type.

149
00:16:09,000 --> 00:16:16,920
So if you had marijuana and flour, right, the flour didn't didn't add any more information to the marijuana.

150
00:16:16,920 --> 00:16:20,960
Right. It was just it was one to one.

151
00:16:20,960 --> 00:16:28,080
So you it was, you know, it's just it was just going to be an extra feature to use.

152
00:16:28,080 --> 00:16:35,640
And it just would have made it more complicated for the learner because it had this extra feature, but the feature didn't add anything.

153
00:16:35,640 --> 00:16:40,440
Yes. So I think we may just need to narrow our analysis down.

154
00:16:40,440 --> 00:16:45,600
And that may be what we get to today with the hemp analysis is so.

155
00:16:45,600 --> 00:16:52,160
It may be worthwhile to just look at the failure rate for flour, right, because.

156
00:16:52,160 --> 00:16:55,440
Well, the other ones matter as well.

157
00:16:55,440 --> 00:17:00,920
But ultimately, you need the clean flour before you process it.

158
00:17:00,920 --> 00:17:08,000
Of course, I think you can, you know, just just keep distilling out the contaminants.

159
00:17:08,000 --> 00:17:17,120
However. We just may need to narrow our analysis just to.

160
00:17:17,120 --> 00:17:23,760
It gets a meaningful. Conclusions, I think.

161
00:17:23,760 --> 00:17:28,080
But keep keep watching me do this. So, OK.

162
00:17:28,080 --> 00:17:35,240
When you do a classifier, you know, you you start out with a dummy classifier.

163
00:17:35,240 --> 00:17:40,600
And the dummy classifier doesn't care about what the inputs are, it just outputs something.

164
00:17:40,600 --> 00:17:52,000
So I came up with a dummy classifier and then because there are only a limited number of product types, I one hot and coded them.

165
00:17:52,000 --> 00:17:58,080
So because you can't put you can't you have to use a numeric value, you can't use the text value.

166
00:17:58,080 --> 00:18:11,960
So what that does is it it takes the value and it adds a column for each for each type and then puts either a zero or one in that column, depending on if it's that type.

167
00:18:11,960 --> 00:18:19,960
So you can see, like the first one, it's harvest material. So there's a one and all the other values are zero.

168
00:18:19,960 --> 00:18:24,760
And so it's the next one's intermediate type.

169
00:18:24,760 --> 00:18:37,000
And so this is a way for the classifier to you to be able to use this categorical data.

170
00:18:37,000 --> 00:18:43,280
So and then I label encoded the status because, again, the status was pass and fail,

171
00:18:43,280 --> 00:18:53,000
but it actually has to be there's your it has to be zero and one in order for the classifier to understand it.

172
00:18:53,000 --> 00:18:58,120
So and then you split the data into a training set, the testing set.

173
00:18:58,120 --> 00:19:11,880
So I held out 20 percent of the data for testing and then I trained a dummy classifier that output the most frequent value, which is pass.

174
00:19:11,880 --> 00:19:21,480
So it just no matter what you give it, it it it you know, it outputs pass.

175
00:19:21,480 --> 00:19:31,600
So then when you look at it, you have to figure out, like, what is a good metric to use to to judge your classifier, if your classifier is good or not.

176
00:19:31,600 --> 00:19:41,920
So if you look at the recall, you know, it's it's ninety nine point two five percent. You know, I mean, that's that looks like it's a great classifier.

177
00:19:41,920 --> 00:19:48,120
But, you know, it it and the confusion matrix down here, you can see it got all the passes right.

178
00:19:48,120 --> 00:19:51,680
But it got all the failures wrong. So that's not really good.

179
00:19:51,680 --> 00:19:58,840
So some of these macro recall and balanced accuracy are 50 percent, which is kind of really what it is.

180
00:19:58,840 --> 00:20:02,600
Right. It got 50 percent of what you're trying to determine. Correct.

181
00:20:02,600 --> 00:20:08,040
So basically, the training data is almost all passes.

182
00:20:08,040 --> 00:20:13,320
And so it basically the algorithm just predicts pass every time, essentially.

183
00:20:13,320 --> 00:20:17,920
Well, that's what that's what it's supposed to do. Right.

184
00:20:17,920 --> 00:20:23,040
If you have. Yeah, this is just this is a baseline.

185
00:20:23,040 --> 00:20:30,640
Like, so you should be able to train a classifier that outperforms the dummy classifier.

186
00:20:30,640 --> 00:20:38,200
Right. The dummy classifier just it only outputs pass or this particular version of it.

187
00:20:38,200 --> 00:20:52,080
And so this is just, you know, this is just to see if, you know, what your if your classifier is the one that the real classifier actually works.

188
00:20:52,080 --> 00:21:01,760
And so I did a second dummy classifier that output distribution is stratified based on like.

189
00:21:01,760 --> 00:21:11,680
You know, it it does a percentage based on what the actual the actual training data, what the what it outputs.

190
00:21:11,680 --> 00:21:19,240
So it doesn't it didn't really do much better.

191
00:21:19,240 --> 00:21:23,560
Right. It comes up with an again, it comes up with a 98 percent recall.

192
00:21:23,560 --> 00:21:33,360
It's just a little bit, but it still does really poorly.

193
00:21:33,360 --> 00:21:37,160
And so is there sort of a problem where.

194
00:21:37,160 --> 00:21:44,560
It's going to start predict well, and then this is actually one of the things that you need to keep in mind.

195
00:21:44,560 --> 00:21:48,280
So one of the rules of forecasting is know the forecasting error.

196
00:21:48,280 --> 00:21:57,240
So, you know, what's what's the bigger problem is predicting a pass when it's actually a fail or predicting a fail and it's actually a pass.

197
00:21:57,240 --> 00:22:00,600
So what's the cost of being wrong?

198
00:22:00,600 --> 00:22:03,080
So. Right.

199
00:22:03,080 --> 00:22:05,200
But again, this is this is a baseline.

200
00:22:05,200 --> 00:22:09,280
This is what we're trying to outperform.

201
00:22:09,280 --> 00:22:14,600
But, yeah, so and again, it's you know, it's about 50 percent right.

202
00:22:14,600 --> 00:22:28,960
So because this is an imbalance, you know, in balanced data set, you can you can you can calculate these weights to help influence the output.

203
00:22:28,960 --> 00:22:40,680
And so it'll sort of weight the failures, the weight, or it'll it'll boost the weights that predict the failures above the ones that predict passing.

204
00:22:40,680 --> 00:22:56,960
So I calculated those and then I use a logistic regressor to come up with it and to do to do an analysis again, only using the type, the product type.

205
00:22:56,960 --> 00:23:01,920
And you can see that it actually does much better at least at predicting failures.

206
00:23:01,920 --> 00:23:04,400
It gets 96 percent of them right.

207
00:23:04,400 --> 00:23:20,480
But then the passing ones drop down to 76 percent and you get 24 percent of the of values that should have passed predicted failures.

208
00:23:20,480 --> 00:23:31,480
OK, so. OK, so with this model, it's only getting like.

209
00:23:31,480 --> 00:23:32,760
Hold on, walk me through this real quick.

210
00:23:32,760 --> 00:23:39,440
So it's getting 75 percent of the path like it's calling a pass a pass correctly, 75 percent.

211
00:23:39,440 --> 00:23:42,600
So that's a true pass.

212
00:23:42,600 --> 00:23:48,800
Yes. And it's getting 90, actually really 96 percent of the failures.

213
00:23:48,800 --> 00:23:53,560
Correct. Wow.

214
00:23:53,560 --> 00:23:57,920
So into those are it's accurate.

215
00:23:57,920 --> 00:24:01,440
So that's saying a fail when it is a true fail.

216
00:24:01,440 --> 00:24:04,640
Right. And so this balanced accuracy, right.

217
00:24:04,640 --> 00:24:08,520
It's about 86 percent.

218
00:24:08,520 --> 00:24:18,400
So right, because that's that takes the average of the path of the pass and the failures, the correct passes and correct failures.

219
00:24:18,400 --> 00:24:22,240
And, you know, and that's about where it should be.

220
00:24:22,240 --> 00:24:31,520
If you look at if you take those two values and divide Adam and divide by two, you that's it's about 86 percent correct.

221
00:24:31,520 --> 00:24:44,320
So. And then what's the percentage here where it's it's labeling it a pass when it should when it's actually a fail?

222
00:24:44,320 --> 00:24:47,640
That's that sort of purplish dark.

223
00:24:47,640 --> 00:24:53,320
OK, that's what I thought. Yeah. And so that's.

224
00:24:53,320 --> 00:24:58,400
Is that four percent or? Yeah.

225
00:24:58,400 --> 00:25:08,320
OK, so that's so that's what that's the square you're really worried about, where the algorithm predicts a pass when it's actually a fail,

226
00:25:08,320 --> 00:25:19,320
because, you know, if the algorithm predicts a fail and it actually passes, you know, it's not.

227
00:25:19,320 --> 00:25:25,160
Basically, so the idea is you want to test the failures like as quick as possible.

228
00:25:25,160 --> 00:25:32,920
That way you can do all your quality assurance tests to make you know that the laboratory does to make sure that it is, in fact, a fail.

229
00:25:32,920 --> 00:25:41,000
So. Predicting that it's going to be a fail and it's actually a pass may kind of increase their costs a little bit.

230
00:25:41,000 --> 00:25:46,600
But what you really want to avoid is saying it's a pass when it's actually a fail.

231
00:25:46,600 --> 00:25:53,160
So that's so to me, this looks like an outstanding fit.

232
00:25:53,160 --> 00:25:57,680
Is this way it looks like to you?

233
00:25:57,680 --> 00:26:07,760
It does on the surface and and if you go down, I actually like I'm actually able to improve on this.

234
00:26:07,760 --> 00:26:15,880
So I took in the the lab and the producer.

235
00:26:15,880 --> 00:26:21,640
And added those as features. Interesting.

236
00:26:21,640 --> 00:26:27,840
Because, again, those are things that you know ahead of time. Right.

237
00:26:27,840 --> 00:26:29,920
There aren't a lot of things that you know ahead of time.

238
00:26:29,920 --> 00:26:35,920
So then I used the Cat Boost classifier, which is a better quality classifier.

239
00:26:35,920 --> 00:26:41,560
And I do that only on I but I used only product type.

240
00:26:41,560 --> 00:26:47,320
OK, so can I ask what are all the factors we're using in prediction at this at this moment?

241
00:26:47,320 --> 00:26:51,840
So are we just using these three variables to predict failure?

242
00:26:51,840 --> 00:26:57,200
Up to this point, we've only been using the product type.

243
00:26:57,200 --> 00:27:09,240
Which I find really weird that that like somehow is that accurate, but.

244
00:27:09,240 --> 00:27:12,640
Because it seems like it would just be kind of.

245
00:27:12,640 --> 00:27:22,440
It seems like it would be random. Well, so once we'll have to test it in practice, I think.

246
00:27:22,440 --> 00:27:30,720
So what's what what happens when you take into consideration lab plus producer?

247
00:27:30,720 --> 00:27:39,160
OK, well, OK, so first I use the Cat Boost classifier with just a product type so we could compare it to the previous classifier.

248
00:27:39,160 --> 00:27:47,880
So if you scroll down some more, it. Let's see.

249
00:27:47,880 --> 00:27:53,680
So, yeah, that. So, yeah, it it's slightly.

250
00:27:53,680 --> 00:27:56,400
It's actually about the same. It's very close to the same.

251
00:27:56,400 --> 00:28:00,680
I think it gets more of the passes correct.

252
00:28:00,680 --> 00:28:05,480
And then when I use the lab type in the producer.

253
00:28:05,480 --> 00:28:10,560
Or the lab in the in the the type the product at the lab in the producer, I got this.

254
00:28:10,560 --> 00:28:15,440
I got 99% of the failures correct.

255
00:28:15,440 --> 00:28:23,720
79% of the passing correct. It's down to what point zero zero five.

256
00:28:23,720 --> 00:28:32,440
For predicting incorrectly predicting a failure or predicting failure is a pass.

257
00:28:32,440 --> 00:28:39,520
So this seems really good.

258
00:28:39,520 --> 00:28:47,920
But if you scroll down a little bit, you know, this is this is.

259
00:28:47,920 --> 00:28:52,600
These are the actual numbers. So it, you know, it correct.

260
00:28:52,600 --> 00:28:59,880
It gets almost all the failures correct. It only it only predicts 17.

261
00:28:59,880 --> 00:29:07,960
Failures is passing, but it predicts eighty three thousand five hundred and fifty.

262
00:29:07,960 --> 00:29:20,360
Passing values is failing. Which is huge.

263
00:29:20,360 --> 00:29:28,080
So. So I just had a question to help orient myself and the the total amount of failures is

264
00:29:28,080 --> 00:29:31,440
the two thousand nine hundred and fifty one and the 17, right?

265
00:29:31,440 --> 00:29:42,960
Like those are the actual true failures. Yes. OK, cool. Thank you.

266
00:29:42,960 --> 00:29:48,960
So I guess the amount of.

267
00:29:48,960 --> 00:29:52,360
I guess what do you call it?

268
00:29:52,360 --> 00:29:57,160
It is false. False positives, I guess that would be

269
00:29:57,160 --> 00:30:00,760
depending on if you want to call it failure or positive.

270
00:30:00,760 --> 00:30:03,280
So, yeah, I think in this case we are.

271
00:30:03,280 --> 00:30:09,680
So I guess if you just put all of those ones that you predict as failure,

272
00:30:09,680 --> 00:30:14,680
but they're actually passes on the assembly line, it may still.

273
00:30:14,680 --> 00:30:17,840
It may I think it would still increase efficiencies, but you're right.

274
00:30:17,840 --> 00:30:31,840
It would still kind of. You know, basically the lower you you can decrease that,

275
00:30:31,840 --> 00:30:35,720
the more efficient you are. And then, you know, of course, the lower you are,

276
00:30:35,720 --> 00:30:39,960
you have this number in this third quadrant, sort of the safer you are.

277
00:30:39,960 --> 00:30:44,680
So that's that's the way I look at it. So if you lower this number, you're going to be safer.

278
00:30:44,680 --> 00:30:48,600
And if you lower this number, you're going to be more efficient.

279
00:30:48,600 --> 00:30:53,440
And of course, you want to maximize these two quadrants.

280
00:30:53,440 --> 00:30:58,720
You're when you're predicting fail, fail predicting pass, pass.

281
00:30:58,720 --> 00:31:03,320
Right. Yeah, I mean, there's a limited amount of information there,

282
00:31:03,320 --> 00:31:09,960
and it seems it seems really odd to me that those combination of features,

283
00:31:09,960 --> 00:31:14,760
you know, is able to predict this accurately. But, you know, it.

284
00:31:14,760 --> 00:31:22,240
That's what it seems to be doing now, what it's actually doing underneath, I don't know.

285
00:31:22,240 --> 00:31:27,400
So and the and the interest, the really interesting thing about this was that you can

286
00:31:27,400 --> 00:31:34,640
you can ask the model what features were the most important in it, making its decision.

287
00:31:34,640 --> 00:31:41,680
And product type is like what about 24 percent.

288
00:31:41,680 --> 00:31:49,960
The lab ID was 74 percent and the producer was like two percent.

289
00:31:49,960 --> 00:31:56,120
So. Well, I think that may be a problem.

290
00:31:56,120 --> 00:32:00,560
Because the lab ID would tell it if it fails.

291
00:32:00,560 --> 00:32:04,400
So I think you need to exclude the lab ID from the analysis, right?

292
00:32:04,400 --> 00:32:09,640
Because that's that's why it's almost like perfect, because it's just.

293
00:32:09,640 --> 00:32:13,800
Certain lab IDs just should always fail, right?

294
00:32:13,800 --> 00:32:15,960
No, the laboratory doing the testing.

295
00:32:15,960 --> 00:32:25,280
Oh, like the I see what you're saying, like the the laboratory license ID. Yeah.

296
00:32:25,280 --> 00:32:30,760
So, I mean, and this sort of brings up the point, you know, do certain labs fail,

297
00:32:30,760 --> 00:32:36,560
you know, are they more stringent in their methodologies and they fail more product

298
00:32:36,560 --> 00:32:44,880
or do they just tend to get product, you know, from, you know, that that fails more often?

299
00:32:44,880 --> 00:32:50,800
I mean, it's kind of and it's interesting that it's actually not.

300
00:32:50,800 --> 00:32:58,520
You know, it's not really closely correlated to the producer.

301
00:32:58,520 --> 00:33:03,000
So, I don't know, this brings up a lot of questions and it's this is kind of as good as I could get.

302
00:33:03,000 --> 00:33:09,560
I actually ran the last classifier through Optuna, which is a.

303
00:33:09,560 --> 00:33:14,480
This is like a genetic algorithm and it goes through and it keeps running tests.

304
00:33:14,480 --> 00:33:19,760
It keeps it keeps trying the classifier with different parameters to optimize it.

305
00:33:19,760 --> 00:33:32,960
It ran for like about 12 hours and I got that tiny little bit of improvements.

306
00:33:32,960 --> 00:33:37,000
So I think this is fantastic.

307
00:33:37,000 --> 00:33:42,800
And so I'm curious.

308
00:33:42,800 --> 00:33:52,480
It makes me curious because it seems to me like an exceptional fit with very limited predicting factors.

309
00:33:52,480 --> 00:34:04,680
So. So we could potentially you could potentially try to cut your teeth on with the same.

310
00:34:04,680 --> 00:34:17,400
Same models is so just a if you're OK with slightly changing gears, but we're we're actually going to keep this whole analysis in mind here for a second.

311
00:34:17,400 --> 00:34:22,600
Because basically what I want to show you is.

312
00:34:22,600 --> 00:34:27,320
There's this hemp database, so still cannabis.

313
00:34:27,320 --> 00:34:35,160
However, hemp, you know, they're actually shooting for low THC levels.

314
00:34:35,160 --> 00:34:39,640
And so there's an interesting database here.

315
00:34:39,640 --> 00:34:49,560
But out by the University of Illinois, where they basically have cultivators submit.

316
00:34:49,560 --> 00:34:53,640
Some information about the strains they're growing.

317
00:34:53,640 --> 00:35:02,400
So essentially how they planted it, when they harvested it.

318
00:35:02,400 --> 00:35:07,880
And then they are getting it tested.

319
00:35:07,880 --> 00:35:21,720
And, you know, several of these laboratories, ACT Laboratories, Pride, Rock River and perhaps a couple more here.

320
00:35:21,720 --> 00:35:30,280
And so. We actually have nice data on.

321
00:35:30,280 --> 00:35:41,400
The. So here we have just the essentially the strain, the cultivar.

322
00:35:41,400 --> 00:35:48,720
The source, I'm not 100 percent certain if this is the producer or perhaps the seed company.

323
00:35:48,720 --> 00:35:54,560
I've got a feeling it's the producer. You also have the state.

324
00:35:54,560 --> 00:35:59,400
So this is for the Midwest. So you primarily have.

325
00:35:59,400 --> 00:36:05,280
Illinois, Wisconsin, Indiana and Michigan, I believe.

326
00:36:05,280 --> 00:36:15,640
And so then you also have the county. And then what I think is going to be an interesting predicting factor is the sample date.

327
00:36:15,640 --> 00:36:22,600
So here they're measuring total THC and total CBD.

328
00:36:22,600 --> 00:36:28,680
Well, they're doing CBG and CBD, it looks like.

329
00:36:28,680 --> 00:36:38,760
So hemp producers, you know, they want a high level of CBD rate like this 13 percent CBD.

330
00:36:38,760 --> 00:36:44,600
That's outstanding. And so that's what producers are trying to produce.

331
00:36:44,600 --> 00:36:51,800
Right. Because they. For example, Kelly.

332
00:36:51,800 --> 00:36:57,080
They need to process this into hemp for for people to buy on the shelves.

333
00:36:57,080 --> 00:37:06,360
So the higher percentage the plant is, the more efficient your extraction is going to be.

334
00:37:06,360 --> 00:37:16,640
However, watch out, because look, the average unit, this actually has a zero point four percent THC.

335
00:37:16,640 --> 00:37:23,200
And so they would actually have to destroy this lot of hemp.

336
00:37:23,200 --> 00:37:34,200
So even though that lot of hemp has 13 percent CBD, it's it breaks the federal.

337
00:37:34,200 --> 00:37:39,280
Limit for hemp, which is zero point three percent.

338
00:37:39,280 --> 00:37:43,680
And so that would actually be a failure rate for hemp.

339
00:37:43,680 --> 00:37:51,000
And so what you see is the failure rate for hemp.

340
00:37:51,000 --> 00:38:04,160
You know, that's going to be a lot higher than the failure rate we observed with quality assurance of recreational cannabis in Washington.

341
00:38:04,160 --> 00:38:16,560
So, Charles, I was thinking your model may be useful for using these some of these factors, such as cultivar source.

342
00:38:16,560 --> 00:38:22,000
I was going to perhaps look at state today with you to see if.

343
00:38:22,000 --> 00:38:29,120
What state the cannabis is produced in may have an effect on its failure rate.

344
00:38:29,120 --> 00:38:33,720
And basically, we want to see, OK, what's the chance of it?

345
00:38:33,720 --> 00:38:41,360
The cannabis, the hemp being less than zero point three percent.

346
00:38:41,360 --> 00:38:50,200
So just to go ahead and show you this data, I've gone ahead and scraped it into.

347
00:38:50,200 --> 00:38:55,200
Just an Excel sheet.

348
00:38:55,200 --> 00:38:56,680
You're welcome to as well.

349
00:38:56,680 --> 00:39:03,400
Basically, I just copied and pasted this 100 observations at a time.

350
00:39:03,400 --> 00:39:10,120
If you think of a better way to extract this data, then definitely let me know.

351
00:39:10,120 --> 00:39:15,080
But just sort of brute force collected it.

352
00:39:15,080 --> 00:39:21,160
And so this will actually be the first time that I'll have done analysis on this.

353
00:39:21,160 --> 00:39:25,040
So we can actually do do it live.

354
00:39:25,040 --> 00:39:28,960
And essentially, we may need to conclude next week.

355
00:39:28,960 --> 00:39:39,600
But basically, the idea is we want to first calculate the if the sample passed or failed

356
00:39:39,600 --> 00:39:43,160
based on the total THC levels.

357
00:39:43,160 --> 00:39:50,720
And then we essentially want to try to predict, say, the failure rate given some of the factors.

358
00:39:50,720 --> 00:39:56,160
So for today, I was going to start looking at state.

359
00:39:56,160 --> 00:40:01,440
So we're just going to do this live real quick.

360
00:40:01,440 --> 00:40:15,040
So well, I suppose I can run it here in VS code.

361
00:40:15,040 --> 00:40:22,760
And so first things first, let's just read in the data.

362
00:40:22,760 --> 00:40:26,880
And just start doing a little exploration here with some of the time we have.

363
00:40:26,880 --> 00:40:35,280
But any thoughts so far, Charles, if this is data that you may be able to work with?

364
00:40:35,280 --> 00:40:43,480
Yeah, actually, there's actually more features that are known in advance with this.

365
00:40:43,480 --> 00:40:47,040
So I can definitely try this.

366
00:40:47,040 --> 00:40:58,040
OK, so it looks like we have our.

367
00:40:58,040 --> 00:41:09,440
So I may move over to Spider here in a second, but we'll start with VS code.

368
00:41:09,440 --> 00:41:14,600
OK, so we've got our variables.

369
00:41:14,600 --> 00:41:21,000
We'll actually here first, we actually want to read in the correct sheet.

370
00:41:21,000 --> 00:41:31,080
OK, so.

371
00:41:31,080 --> 00:41:38,640
So I went ahead and collected several of the little cut that all of the data points here,

372
00:41:38,640 --> 00:41:39,640
really.

373
00:41:39,640 --> 00:41:46,200
So you may want to take a look at this, Charles and everyone else as well.

374
00:41:46,200 --> 00:41:51,240
But I didn't see if there was a quite a one to one relationship here.

375
00:41:51,240 --> 00:41:56,560
So basically, you see there's two hundred and fifty three.

376
00:41:56,560 --> 00:42:00,360
What they call cultivar entries.

377
00:42:00,360 --> 00:42:07,040
And so this says the harvest date.

378
00:42:07,040 --> 00:42:16,800
And you see, for example, there's three and then when you go and look.

379
00:42:16,800 --> 00:42:23,260
At like the summary table.

380
00:42:23,260 --> 00:42:27,960
You see there were OK, there were three abbeys.

381
00:42:27,960 --> 00:42:31,160
And then this just has their average.

382
00:42:31,160 --> 00:42:40,240
So this is trying to uncover like, OK, do specific strains have sort of expected CBD

383
00:42:40,240 --> 00:42:47,920
and THC levels to essentially hemp producers are trying to.

384
00:42:47,920 --> 00:42:54,480
Sort of narrow it down and sort of settle on some stock strains that they can rely on,

385
00:42:54,480 --> 00:43:02,160
because as you can see, people are having a problem with failing for THC levels.

386
00:43:02,160 --> 00:43:05,620
Here is the more granular data.

387
00:43:05,620 --> 00:43:12,080
You can see the three abbeys and you can see the three different tests.

388
00:43:12,080 --> 00:43:21,160
However, there's does not there's seven hundred and fifty three test data points.

389
00:43:21,160 --> 00:43:29,040
So I'm not certain how these map to.

390
00:43:29,040 --> 00:43:33,520
The the cultivar entries.

391
00:43:33,520 --> 00:43:37,400
I've got a suspicion that perhaps.

392
00:43:37,400 --> 00:43:43,000
Say a farmer planted a field of.

393
00:43:43,000 --> 00:43:45,640
Berry blossoms.

394
00:43:45,640 --> 00:43:51,720
Or what common strain you see is this cherry wine, cherry blossom.

395
00:43:51,720 --> 00:43:57,000
So if a farmer grew a field of that, they may they may sample it multiple times.

396
00:43:57,000 --> 00:44:02,440
So we still need to get to the bottom of of this data.

397
00:44:02,440 --> 00:44:08,540
And because it would be nice to be able to combine.

398
00:44:08,540 --> 00:44:17,160
The cannabinoid data with the the harvest date.

399
00:44:17,160 --> 00:44:22,600
Second the next best thing to the harvest date we have is just the sample date.

400
00:44:22,600 --> 00:44:30,760
And basically to show you some of the figures that they've put together that will be recreating

401
00:44:30,760 --> 00:44:41,300
is essentially here is. The plot of total THC to total CBD.

402
00:44:41,300 --> 00:44:51,200
So as we mentioned earlier, the producers, they want to maximize their CBD while staying

403
00:44:51,200 --> 00:44:56,240
below the federal zero point three percent threshold.

404
00:44:56,240 --> 00:45:03,960
And so. Would I notice that's interesting that I actually just noticed this looking

405
00:45:03,960 --> 00:45:09,520
at this this time is it looks like there is a slight positive correlation here.

406
00:45:09,520 --> 00:45:13,540
So that would make sense.

407
00:45:13,540 --> 00:45:19,920
Generally the plants that are producing higher cannabinoids, higher CBD also produce higher

408
00:45:19,920 --> 00:45:24,560
THC. And so you're sort of playing a dance here

409
00:45:24,560 --> 00:45:30,880
where you're trying to grow the highest CBD plant you can.

410
00:45:30,880 --> 00:45:37,840
To the point where you stay below the threshold. Because as you can see, I mean, it's looking

411
00:45:37,840 --> 00:45:46,560
like a non negligible portion of the sample is is failing.

412
00:45:46,560 --> 00:45:55,440
The one thing I've read is that and and the data here to do this analysis is here.

413
00:45:55,440 --> 00:46:03,680
The time between the harvest and the testing. Apparently after you harvest hemp.

414
00:46:03,680 --> 00:46:09,960
The THC level increases. And after a certain like people have have

415
00:46:09,960 --> 00:46:16,960
tested their hemp and it's passed and then they're like transporting it and they get

416
00:46:16,960 --> 00:46:24,880
pulled over and they test it and they fail because an amount of time has passed where

417
00:46:24,880 --> 00:46:28,120
the THC level has risen. So there's sort of this race against the

418
00:46:28,120 --> 00:46:34,240
clock. And you know what's happening is is it's these

419
00:46:34,240 --> 00:46:42,480
CBG levels. So THC is a I'm not going to use the correct scientific

420
00:46:42,480 --> 00:46:47,920
word here. It's a it's a derivative of CBG.

421
00:46:47,920 --> 00:46:56,000
So CBG is a precursor element. So from or maybe CBGA.

422
00:46:56,000 --> 00:47:00,960
I may be getting it wrong. But essentially they all all of the cannabinoids

423
00:47:00,960 --> 00:47:07,920
essentially have a common precursor and depending on various factors such as you mentioned how

424
00:47:07,920 --> 00:47:18,480
it's stored determines how the chemicals you know break down and oxidize into a different

425
00:47:18,480 --> 00:47:24,320
you know how they break down and turn into different cannabinoids.

426
00:47:24,320 --> 00:47:33,320
So so that's an interesting observation Charles. So

427
00:47:33,320 --> 00:47:41,520
we may not be able to incorporate the CBG levels but

428
00:47:41,520 --> 00:47:49,760
but what we could potentially even try to do is say say you're I don't know how we

429
00:47:49,760 --> 00:47:56,120
would work at this into our analysis but say we could try to predict so say you grew in

430
00:47:56,120 --> 00:48:01,280
abacus strain. Okay well given that you've grown an abacus

431
00:48:01,280 --> 00:48:15,240
strain in Wisconsin and that the you know historic abacus has a you know is expected

432
00:48:15,240 --> 00:48:23,960
CBG level of 0.24 you know given its you know historic averages you know you could almost

433
00:48:23,960 --> 00:48:29,800
try to predict you know the probability of you know a sample failing.

434
00:48:29,800 --> 00:48:37,480
So I think this would be incredibly helpful analytics to hemp producers because essentially

435
00:48:37,480 --> 00:48:43,940
if you could give them like a chart because what I think would be cool would be to plot

436
00:48:43,940 --> 00:48:51,320
a figure so like you were saying like to on the x-axis you would have number of days so

437
00:48:51,320 --> 00:49:06,640
you could have like number of days between harvest and sample sample date.

438
00:49:06,640 --> 00:49:14,120
So it would be interesting to see if the number of days in between there would affect the

439
00:49:14,120 --> 00:49:22,560
probability of you failing of failing for high THC.

440
00:49:22,560 --> 00:49:29,080
So that's the analysis we're going to start doing so like I said I'm just now starting

441
00:49:29,080 --> 00:49:36,480
so you're going to see me basically just start hacking at this live right now but then

442
00:49:36,480 --> 00:49:44,080
I think we can continue this next week and actually actually build a good model here

443
00:49:44,080 --> 00:49:53,280
and actually make some some worthwhile predictions and produce some good statistics here.

444
00:49:53,280 --> 00:50:05,440
So first let's just like I said the you know rule number one look at the data so let's

445
00:50:05,440 --> 00:50:17,680
see if we can't do just that.

446
00:50:17,680 --> 00:50:27,040
Okay so we've got 752 samples here.

447
00:50:27,040 --> 00:50:40,200
The average is 5% CBD you've got the average THC is 0.23 so about 0.24% which is just squeaking

448
00:50:40,200 --> 00:50:52,160
under the cutoff but as you see there's a standard deviation of 0.2 you know 0.2 so

449
00:50:52,160 --> 00:50:57,560
you're going to have a large amount failing and so instead of just saying a large amount

450
00:50:57,560 --> 00:50:59,840
let's actually calculate that real quick.

451
00:50:59,840 --> 00:51:20,400
So let's just say okay why don't we just calculate the length of the data where and I may rename

452
00:51:20,400 --> 00:51:25,280
these columns here in a second for ease of use.

453
00:51:25,280 --> 00:51:37,480
So we want to locate these where the THC is greater than or equal to 0.3% and it looks

454
00:51:37,480 --> 00:51:54,560
like we've got a column naming issue going on here.

455
00:51:54,560 --> 00:52:20,640
Okay so for ease of use let's just rename some of these columns real quick

456
00:52:20,640 --> 00:52:33,280
I'm just going to do it here just.

457
00:52:33,280 --> 00:52:49,280
Thanks for bearing with me.

458
00:52:49,280 --> 00:53:00,680
Now it's just going to be a bit easier to work with the data.

459
00:53:00,680 --> 00:53:04,800
Now we can do this a bit quicker.

460
00:53:04,800 --> 00:53:19,280
So let's find out all of the data where it's we've got 184 and so what's the percentage

461
00:53:19,280 --> 00:53:23,800
well that's just going to be out of the total data set.

462
00:53:23,800 --> 00:53:33,080
So Charles this may give you a bit better of a failure rate for your models because

463
00:53:33,080 --> 00:53:46,800
here we've got you know almost 25% of the samples failing for high THC and that actually

464
00:53:46,800 --> 00:53:58,660
sounds about right from my sort of from my historic experience and so that's it's actually

465
00:53:58,660 --> 00:54:05,200
a major concern for hemp producers because historically I don't think every hemp producer

466
00:54:05,200 --> 00:54:13,120
was getting their product tested for THC but it's because you now have to get your lots

467
00:54:13,120 --> 00:54:23,760
tested and so you know cultivators are you know it's a major bummer right because you've

468
00:54:23,760 --> 00:54:30,960
gone through your whole harvest you've done all of your work you've you've you've harvested

469
00:54:30,960 --> 00:54:36,480
you've dried you've cured and you send it in and you you fail for high THC and the rules

470
00:54:36,480 --> 00:54:43,440
say you now have to destroy your product so that could be a devastating hit to farmers

471
00:54:43,440 --> 00:54:50,960
to cultivators who aren't expecting that so that's why it's important to to look at this

472
00:54:50,960 --> 00:54:59,160
data so that way farmers can know ahead of time depending on what strain they're growing

473
00:54:59,160 --> 00:55:05,680
or when they're harvesting you know what what's their probability of failing because as we've

474
00:55:05,680 --> 00:55:12,880
noted you're sort of they're playing they're playing this this is a careful dance here

475
00:55:12,880 --> 00:55:20,920
where they have to wait as long as they can to kind of get their CBD levels up but then

476
00:55:20,920 --> 00:55:35,400
you know you need to harvest before your your THC becomes a problem okay so now we essentially

477
00:55:35,400 --> 00:56:02,040
just want to to code here and if it's a pass or a fail so we'll just say fail

478
00:56:02,040 --> 00:56:14,440
so how can we assign zero one here you may know off the top of your head Charles you

479
00:56:14,440 --> 00:56:27,600
can use is it the label encoder scikit-learns label encoder it'll or you could do just like

480
00:56:27,600 --> 00:56:41,520
a you could do a map you could do like a panda's map well then I think this may so I think

481
00:56:41,520 --> 00:56:50,320
this may do the trick here but you know you're you're seeing essentially how about this but

482
00:56:50,320 --> 00:57:02,440
basically we just want this is a zero or one where it's a one for failure and so we want

483
00:57:02,440 --> 00:57:21,440
this to be a failure where if it's greater than or equal to 0.3 percent there we have

484
00:57:21,440 --> 00:57:30,480
it okay okay so we're running near the end here so essentially what I'm going to do next

485
00:57:30,480 --> 00:57:39,840
is essentially run a logistic regression of the failure rate because this is you know

486
00:57:39,840 --> 00:58:04,280
zero or one yes and so this is where you can do some real interesting analysis what by

487
00:58:04,280 --> 00:58:09,960
you know with some binary models and so we're going to use the list logistics regression

488
00:58:09,960 --> 00:58:21,360
so basically like I said we may need to wait till next week so that way I can actually

489
00:58:21,360 --> 00:58:30,200
write the model and have it ready ready to go for us but essentially I'll be using the

490
00:58:30,200 --> 00:58:42,280
logit and then so that's our dependent variable and our independent variables they are we'll

491
00:58:42,280 --> 00:58:49,640
want to think about how we can work with these and so right off the bat I was thinking okay

492
00:58:49,640 --> 00:58:58,120
we can basically use dummy variables depending on what state they're in so real quick in

493
00:58:58,120 --> 00:59:04,560
the last minute we'll talk about some of the factors that we'll use for next week's analysis

494
00:59:04,560 --> 00:59:24,520
but basically we'll I'll be using state to start with so we essentially have Illinois's

495
00:59:24,520 --> 00:59:28,360
we'll have to keep in to consideration that it looks like it's spelled in two different

496
00:59:28,360 --> 00:59:35,040
ways here or it's got a space so we'll want to make sure to strip out that space but we

497
00:59:35,040 --> 00:59:42,400
basically have Illinois Wisconsin Indiana and Michigan and so right off the bat I essentially

498
00:59:42,400 --> 00:59:51,000
want to you know calculate the failure rates in these states right so for example you know

499
00:59:51,000 --> 01:00:14,200
weights so so for example data

500
01:00:14,200 --> 01:00:22,200
so you can start to you know right off the bat just look at conditional averages so if

501
01:00:22,200 --> 01:00:28,400
you've been here before you'll know that you'll know that I'll say that you can find extraordinary

502
01:00:28,400 --> 01:00:35,560
insights by just taking conditional averages so we'll start doing formal regressions next

503
01:00:35,560 --> 01:00:42,280
week so next week I'll show you a logistic regression where we start to predict failure

504
01:00:42,280 --> 01:00:52,680
rates but for now I'll let you start looking over this data I've published this to to

505
01:00:52,680 --> 01:01:09,880
get hub so you should be able to find this data set here on get hub in the cannabis data

506
01:01:09,880 --> 01:01:19,440
science repository so I'll let you start looking at this data yourself as good Bayesians until

507
01:01:19,440 --> 01:01:27,960
next week and then that way you can establish your priors and then we can do our analysis

508
01:01:27,960 --> 01:01:36,160
next week and like I said start looking at some conditional averages so for example just

509
01:01:36,160 --> 01:01:43,760
this simple one right you've got a four percent higher failure rate in Indiana than in Michigan

510
01:01:43,760 --> 01:01:50,760
interesting maybe that's insignificant maybe that's significant so next week we'll start

511
01:01:50,760 --> 01:02:02,960
looking at some of the factors such as state and potentially try to use harvest date and

512
01:02:02,960 --> 01:02:11,160
sample date we'll have to be creative on how we measure that so maybe we measure how many

513
01:02:11,160 --> 01:02:19,480
days into the year it took them to harvest so think everybody think about that and we've

514
01:02:19,480 --> 01:02:25,480
reached the end of the hour here so I'll let people get on with their day but are there

515
01:02:25,480 --> 01:02:41,840
any questions comments concerns here at the end did you learn something David was it worthwhile

516
01:02:41,840 --> 01:02:45,240
yeah absolutely very it's a shame I'm not in front of my computer because I would like

517
01:02:45,240 --> 01:02:49,400
to try to duplicate some of the stuff but uh but yeah absolutely it's quite fascinating

518
01:02:49,400 --> 01:02:54,520
I would like to find out this is just on my own grab some of that data and compare it

519
01:02:54,520 --> 01:03:00,240
if our soil what's in the soil within each individual state or each individual area or

520
01:03:00,240 --> 01:03:06,440
it's all good I'd like how you think because that's exactly how you get these brilliant

521
01:03:06,440 --> 01:03:11,880
insights you take one data set and you combine it with another and so you had a brilliant

522
01:03:11,880 --> 01:03:21,440
idea look at the soil so if you can get soil data based on Michigan based in Indiana Wisconsin

523
01:03:21,440 --> 01:03:28,560
Illinois perhaps there is a factor in the soil that's affecting failure rates in the

524
01:03:28,560 --> 01:03:33,680
different states so that's exactly how you make these brilliant insights is combining

525
01:03:33,680 --> 01:03:42,600
data sets so I like how you think thank you awesome awesome guys and gal yeah Heather

526
01:03:42,600 --> 01:03:50,560
so crew so it was awesome meeting with you all and talking about cannabis data and we've

527
01:03:50,560 --> 01:03:57,760
got a good amount to dive into next week so next week I'll have the logistic regression

528
01:03:57,760 --> 01:04:05,440
prepared Charles if you want to apply your prediction model to the hemp data I think

529
01:04:05,440 --> 01:04:13,800
we've got some good some good insights we can uncover so thank you very much definitely

530
01:04:13,800 --> 01:04:17,920
well it was awesome speaking with you all today and until next week have a productive

531
01:04:17,920 --> 01:04:23,720
week thanks a lot have an awesome day