1
00:00:00,000 --> 00:00:11,920
Welcome to Cannabis Data Science. Fabulous start to the year, crunching a lot of data,

2
00:00:11,920 --> 00:00:17,600
working on a lot of cool cannabis data related projects. Definitely want to get you all involved

3
00:00:17,600 --> 00:00:23,520
today. That's why I'm going to share with you one of the largest data sets of lab results

4
00:00:23,520 --> 00:00:32,800
out there. It's nicely curated, fresh for statistics and you to peruse. I'll give you a quick

5
00:00:32,800 --> 00:00:38,720
take and then give you a bunch of ideas for projects that I'm working on. It's a meetup after

6
00:00:38,720 --> 00:00:45,920
all, so just want to give you all a chance to share what's on your mind and what projects may

7
00:00:45,920 --> 00:00:52,800
be at hand for you. Cannabis, I'd love to get you involved in some of these projects coming up,

8
00:00:52,800 --> 00:00:59,840
especially testing. Testing is a big thing at hand. However, I'm curious, what are you interested in

9
00:00:59,840 --> 00:01:04,320
at the start of the year? Any cool projects on your mind? Any questions that you'd like to answer?

10
00:01:05,600 --> 00:01:16,800
Well, let's see. I do have my Massachusetts Cure-Relief COAs and so I was looking at a

11
00:01:16,800 --> 00:01:29,200
LeCacha lab and I'm using Chemlytics. There are some assumptions like LeCacha labs that it's

12
00:01:29,200 --> 00:01:34,000
going to be at an address that's different from Massachusetts. I'm just looking through that.

13
00:01:34,000 --> 00:01:48,160
And I'm also using Spacey NLP to get the entities. I have C-PILs, various pages also.

14
00:01:49,840 --> 00:01:55,280
So I'm wondering if I shouldn't just be looking for the number of pages, then grabbing the lab,

15
00:01:55,280 --> 00:02:01,920
then just grabbing the address and everything on the COA, not necessarily assuming that something's

16
00:02:01,920 --> 00:02:06,640
going to be a certain address. So I've just been kind of playing around with that. Thank you, Keegan.

17
00:02:07,920 --> 00:02:15,280
This is phenomenal. Sneha, what Candice is talking about is, and this is actually relevant today,

18
00:02:15,920 --> 00:02:22,400
samples for cannabis go through quality assurance testing. They get tested for cannabinoids.

19
00:02:22,400 --> 00:02:28,640
The certificate of analysis is issued. And in states like Washington State, the public is

20
00:02:28,640 --> 00:02:33,920
permitted access to these certificates. So that's why it would be phenomenal if you could go into a

21
00:02:33,920 --> 00:02:41,280
retailer and know right off the bat, what are the cannabinoid percentages in this product?

22
00:02:41,280 --> 00:02:51,920
Are there any contaminants? And any other cool details like who was the producer? Who tested it

23
00:02:51,920 --> 00:02:58,240
in case you wanted to follow up with the lab for some follow-up questions. So all of these

24
00:02:58,240 --> 00:03:06,720
are pertinent details, just it's trapped in these PDFs. So we created this tool, COA doc,

25
00:03:06,720 --> 00:03:13,920
to basically parse out the data to the best of our abilities using any tool and cleverness that we

26
00:03:13,920 --> 00:03:19,200
can think of. That's cool. I didn't even know that data was publicly available. Exactly. It

27
00:03:19,200 --> 00:03:28,560
depends on the state. The cannabis industry, well, the cannabis regulators have made an effort

28
00:03:29,440 --> 00:03:39,280
to emphasize transparency in the market because this is federally not permitted. So the states

29
00:03:39,280 --> 00:03:46,080
wanted to be transparent about what they're doing, how they're regulating it, get the information to

30
00:03:46,080 --> 00:03:54,240
potential consumers. Now I did put in a Freedom of Information Act for both Massachusetts, where I

31
00:03:54,240 --> 00:04:02,400
reside, and Florida, where I reside a few months in the winter. And they have 10 business days to

32
00:04:02,400 --> 00:04:11,200
respond via writing. And I guess her Freedom of Information Act law has expired, but also too,

33
00:04:11,200 --> 00:04:21,600
my mail may need to get from Massachusetts, CCC, over to Florida, where I am now. But I'm hoping

34
00:04:21,600 --> 00:04:27,200
that what I'm asking for is I'm asking for the exact same data that the state of Washington

35
00:04:27,200 --> 00:04:40,480
provides, pesticides, COAs, some SOP type of knowledge that Keegan has just done amazing work

36
00:04:40,480 --> 00:04:46,000
with. So I'm hoping that Massachusetts and Florida will follow, and then other states.

37
00:04:46,960 --> 00:04:50,400
Well, I can help on that effort. And welcome to the group,

38
00:04:50,400 --> 00:04:55,920
and Elise, just a heads up, we are recording just to save for future sake in case we

39
00:04:55,920 --> 00:04:59,600
think of anything interesting. Gotcha. No worries. Thank you.

40
00:05:00,320 --> 00:05:08,400
Basically, we're talking about what publicly available cannabis data is there. Well, cannabis

41
00:05:08,400 --> 00:05:14,880
goes through quality control testing, ideally for the consumers and in medical states, the patients.

42
00:05:15,520 --> 00:05:21,040
So it seems logical that, you know, they should have access to it. And in certain states,

43
00:05:21,040 --> 00:05:26,320
they explicitly do. And so we're capitalizing on that. So I think there's Washington state,

44
00:05:26,320 --> 00:05:31,040
and I think there's a total of six, where you're definitely allowed to get the certificate

45
00:05:31,040 --> 00:05:38,400
certificates of analysis. It's just we just need to start slow and methodical and it'll lead into

46
00:05:38,400 --> 00:05:46,240
the one of the insights of the day. Why are we asking you this data? Well, to learn about

47
00:05:46,240 --> 00:05:53,920
cannabis, and hopefully show you how you can draw insights from similar data sets. The way I like

48
00:05:53,920 --> 00:06:02,640
to say it, this data is so rich, give any good data science a chance to take a look, and they'll

49
00:06:02,640 --> 00:06:09,200
probably walk away with a novel insight that no one else had thought about. So today, I'm going

50
00:06:09,200 --> 00:06:16,960
to do a demonstration of that. Share with you a really simple but interesting way that you can

51
00:06:16,960 --> 00:06:25,040
draw insights using just a small subset of the data. And you can now analyze the data and

52
00:06:25,040 --> 00:06:30,960
hopefully draw some insights of your own. It's meetup after all. Sneha, would you want to say

53
00:06:30,960 --> 00:06:38,320
a word for yourself and what you maybe want to learn or accomplish in the coming year, especially

54
00:06:38,320 --> 00:06:45,760
cannabis data science related? Well, I'm currently in a master's program at CU Boulder for data

55
00:06:45,760 --> 00:06:54,880
science. Data science wasn't always my field of study. I graduated in physiology from CU Boulder,

56
00:06:55,440 --> 00:07:02,880
but yeah, this was honestly one of my tests for a class to attend a meetup and learn some more about

57
00:07:03,440 --> 00:07:11,120
the field. And I thought this was interesting because I live in Colorado and weed is like a

58
00:07:11,120 --> 00:07:19,840
big thing here. So I just wanted to learn a little bit more about cannabis's relation with data

59
00:07:19,840 --> 00:07:26,240
science because I didn't even really consider it. So I think this whole thing is pretty cool and

60
00:07:26,240 --> 00:07:32,080
I'm willing to learn some more. I don't have much experience, but yeah. Phenomenal. Welcome to the

61
00:07:32,080 --> 00:07:38,720
group, Sneha. You're in for a treat today. We'll definitely share with you, well, hopefully some

62
00:07:38,720 --> 00:07:45,360
cool insights, some cool ways to crunch data and of course the data itself. So threefold, you should

63
00:07:45,360 --> 00:07:50,160
walk away with at least some bit of value. Welcome to the group, Isaac. We're just kind of doing a

64
00:07:50,160 --> 00:07:55,360
quick round of introductions before getting into the data at hand. Before I let you introduce

65
00:07:55,360 --> 00:08:03,120
yourself to the group, Emily, please correct me if I'm mispronouncing your name, but please let me know

66
00:08:03,120 --> 00:08:08,000
if you'd like what you'd hope to get out of the group or learn here in the coming year.

67
00:08:08,000 --> 00:08:16,160
Yeah, so I look to just, I guess, I don't know, I just stumbled upon the group and I am kind of new

68
00:08:16,160 --> 00:08:28,080
to data analysis. I'm in my last year of my BA program for BI, data analytics with

69
00:08:28,080 --> 00:08:34,560
business intelligence. So I just figured for good practice in a project, I mean, I like weed. I live

70
00:08:34,560 --> 00:08:43,520
in Washington, right? So I just figured this would be a good way to dive into data science. So that's

71
00:08:43,520 --> 00:08:51,280
why I'm here. Phenomenal. And in fact, that's one thing that I thought was fun about when we did

72
00:08:51,920 --> 00:08:59,280
a series of Saturday morning statistics and I'll be uploading those throughout the spring. So you'll

73
00:08:59,280 --> 00:09:06,640
get a treat if you want to start catching up on those because it is cannabis after all, it's fine.

74
00:09:07,200 --> 00:09:14,080
And so it can be a fun way to learn about data science and statistics and the scientific method.

75
00:09:15,920 --> 00:09:22,240
So it's cool to have you here and let us know if you have any particular questions

76
00:09:22,240 --> 00:09:31,920
or ideas that way we can pursue those further. Now, Isaac, we had a time mix up last week,

77
00:09:31,920 --> 00:09:38,640
but back on track this week, got some cool analyses. However, love to hear about some of perhaps

78
00:09:38,640 --> 00:09:44,800
the work that you may have done, if you may want to share. So happy to have you here, Isaac.

79
00:09:44,800 --> 00:09:51,840
Yeah, of course. And I see from the chat that you shared a data file with me. It looks like

80
00:09:51,840 --> 00:09:58,800
the Washington data, but compiled into a very friendly format for analysis. Is that the file?

81
00:09:59,600 --> 00:10:06,880
Exactly. And this was what we thought was the value added was simply, and remember,

82
00:10:06,880 --> 00:10:12,480
it took us a while to get here. We had to diagram out the data and think about how we were going to

83
00:10:12,480 --> 00:10:17,520
merge it. This isn't anything fancy and you could probably have gotten to the same results

84
00:10:17,520 --> 00:10:29,040
through SQL queries, but this was basically us chomping down around 43 gigabytes of the CCRS data

85
00:10:29,840 --> 00:10:41,360
to about, it's only about 20,000, to only about 20 kilobytes. So we went from about 43 gigabytes

86
00:10:41,360 --> 00:10:47,760
of raw data. Of course, we're not looking at sales yet, but we boiled this down to about 20 kilobytes

87
00:10:47,760 --> 00:10:55,760
of useful lab result data. So yes, you can go work with the raw data. However, if you just want a

88
00:10:55,760 --> 00:11:04,480
quick glance at the lab results, then this is an effective way to make the data accessible. Also,

89
00:11:04,480 --> 00:11:12,320
welcome to the group, Yasha. We're going to be doing a lot of analysis of, well, we're going to

90
00:11:12,320 --> 00:11:20,240
start with analysis of lab results and we're going to use it in a peculiar way, but in a way,

91
00:11:20,240 --> 00:11:29,440
I'll show you how useful lab testing is. But before I get into that, I know that Isaac and perhaps

92
00:11:29,440 --> 00:11:37,360
yourself are looking at the data. So before I get into my trivial exercise, Isaac, you wouldn't want

93
00:11:37,360 --> 00:11:43,120
to share? I mean, you shared with me a figure. I don't know if you're all prepared to present or

94
00:11:43,120 --> 00:11:44,800
anything, but you don't have to.

95
00:11:44,800 --> 00:11:52,800
I'm happy to. I mean, we were discussing about the microbiome, the EB measurements from Washington

96
00:11:52,800 --> 00:12:00,720
Labs, and I just put it into a graph and I think that'd be interesting for us to all take a look

97
00:12:00,720 --> 00:12:08,400
at it and see what the group thinks. I'll try to present my screen.

98
00:12:08,400 --> 00:12:13,920
I love how you took this one step further. Essentially, we were just looking at detections.

99
00:12:13,920 --> 00:12:20,720
So last week, we were just seeing all the different pesticides that were detected. We hadn't gotten to

100
00:12:20,720 --> 00:12:26,560
the point of adding the limits. So I love that you actually did this because now this shows us not

101
00:12:26,560 --> 00:12:34,800
only the microbes detected, but also roughly the percentage that falls above the failure limit.

102
00:12:34,800 --> 00:12:42,080
Yes, and I understand this graph might be a little bit difficult to see what it is about, so I'll

103
00:12:42,080 --> 00:12:51,440
just go try to explain in my terms. What you see here are night plots. Each figure is associated

104
00:12:51,440 --> 00:13:04,480
with a number 2909, for example, 290910. Each is a one lab. And x-axis is a log of microbial

105
00:13:04,480 --> 00:13:17,520
detection. So 2 is 100, 4 is 10,000. And the y-axis, because this is a histogram,

106
00:13:17,520 --> 00:13:27,680
so y-axis is just the density of data. For example, on the top left, you see there are roughly

107
00:13:27,680 --> 00:13:40,560
20 plus samples that have a detection around 10 to the power of 3 and all the way to the power of

108
00:13:40,560 --> 00:13:49,520
4. There has been around 20 samples in each of that, each bin. And the red vertical line...

109
00:13:50,160 --> 00:13:55,600
Quick question. Are we looking at total failures right now, or is this a specific analyte?

110
00:13:55,600 --> 00:14:04,640
It's a specific analyte. It's an intro bacteria, and it's the type of bacteria you'll find in

111
00:14:05,440 --> 00:14:14,320
people's gut. One kind of straightforward way to explain it is just a poop bacteria.

112
00:14:15,040 --> 00:14:24,800
And obviously, you do want to see a lot of that in your flower. And here the results are all...

113
00:14:24,800 --> 00:14:31,600
Well, first of all, I have filtered out all non-detections. Otherwise, there'll just be a

114
00:14:31,600 --> 00:14:37,920
huge bar at, say, zero or whatever is the detection limit, and we won't be seeing anything

115
00:14:38,960 --> 00:14:47,200
to the right, because the detections are only a fragment of clean samples. And here,

116
00:14:47,200 --> 00:14:56,960
the red vertical line is the regulatory limit. That means any samples to the right of it, say,

117
00:14:56,960 --> 00:15:08,320
for lab 11 on the top plot, there is one sample, two samples, three samples that failed the test.

118
00:15:08,320 --> 00:15:16,880
Because their result is to the 10 to the power of four plus, so they failed.

119
00:15:16,880 --> 00:15:21,920
Two thoughts come to mind. I should have suggested this last week. One of the first analyses we did

120
00:15:21,920 --> 00:15:30,560
was just look at the residual solvents. We compared butane in Washington versus California.

121
00:15:31,120 --> 00:15:36,400
And we were seeing that there were concentrates that were making it to the shelves in Washington

122
00:15:36,400 --> 00:15:43,360
that wouldn't have passed California's quality control standards. So I wonder if something similar

123
00:15:43,360 --> 00:15:51,600
may be... I wonder what... So for example, I wonder what your microbe detection limit is in Massachusetts.

124
00:15:51,600 --> 00:15:57,760
So for example, I wonder if some of these samples either would or would not make it to the shelves

125
00:15:57,760 --> 00:16:06,320
in Massachusetts. That's a great question. And the results are all very similar.

126
00:16:06,320 --> 00:16:15,200
Limits actually differ in some cases significantly. And even entire testings. The state of Washington

127
00:16:15,200 --> 00:16:27,280
only require testing intrabacteria. And while Massachusetts require four, we require AC,

128
00:16:27,280 --> 00:16:36,160
the bacteria that doesn't breathe oxygen, and CC, coliform, and the most important, eSAM mold.

129
00:16:36,160 --> 00:16:44,160
If a sample is moldy, it won't pass. And also EB, which is this gut bacteria.

130
00:16:44,160 --> 00:16:53,200
And Washington state only requires this one type of bacteria to be tested.

131
00:16:53,200 --> 00:17:02,080
So that's a very big difference to start with. And also in terms of limits, it's also different.

132
00:17:02,080 --> 00:17:11,920
I believe that EB... Well, actually, I can't remember the exact number. I think EB is 10 to the power of

133
00:17:11,920 --> 00:17:18,320
three versus 10 to the power of four here. And for other screens,

134
00:17:18,320 --> 00:17:30,160
the residual solvents is more obvious. For example, the built-in limit for Massachusetts

135
00:17:30,160 --> 00:17:40,960
is at 12 ppm, which is very, very low versus 5,000 in most states. But that's a rather a

136
00:17:40,960 --> 00:17:46,160
difference in regulators' approach in making their laws.

137
00:17:46,160 --> 00:17:51,200
Two more questions, and then I'll get to my second thought. The first was, what exactly are the units

138
00:17:51,200 --> 00:17:58,320
here, the test value? So we've got the limit at four. So at four coliforming units, you fail.

139
00:17:59,120 --> 00:18:07,200
Okay. So that was the other thing that kind of jumped out is, I guess I'm curious about the

140
00:18:07,200 --> 00:18:13,360
number of tests happening in each lab. So for example, lab 11, it looks like it's a smaller lab.

141
00:18:13,360 --> 00:18:22,720
So they may not be the most comparative example. But for example, it does kind of look like a lot...

142
00:18:23,440 --> 00:18:29,520
For example, let me not throw them under the bus or anything, but lab 2912,

143
00:18:29,520 --> 00:18:36,320
coincidentally, a large percentage are falling under the four coliforming units.

144
00:18:36,320 --> 00:18:42,080
And so I think it'd be interesting to, I guess, compare the different labs to see

145
00:18:43,120 --> 00:18:49,120
what's sort of the mean and variance at each lab, because this is where we were kind of

146
00:18:49,680 --> 00:18:55,920
talking about it wouldn't hurt to have a standardized method, because if lab 2912 is

147
00:18:55,920 --> 00:19:06,160
there maybe not incubating as long as lab 2914 may have a structural effect on their results.

148
00:19:06,160 --> 00:19:12,080
Are you kind of thinking something similar? Yeah. I mean, it's definitely one approach,

149
00:19:12,080 --> 00:19:20,800
but for me, what's striking on this nine different plots is the change of behavior.

150
00:19:20,800 --> 00:19:28,720
Around regulatory limit. I mean, for a normal bacterial growth, you would expect it to be a

151
00:19:29,280 --> 00:19:37,760
natural phenomenon and it's going to be a nice curve. Rather, what we're seeing here are

152
00:19:38,480 --> 00:19:47,200
kind of almost two populations or even a cutoff around four. One of the things that we're seeing

153
00:19:47,200 --> 00:19:59,200
or even a cutoff around four. Well, for example, lab 2914, you can see they have a lot of detections

154
00:19:59,920 --> 00:20:07,680
just right below four, but above four, it reduced a significant amount. And you can just see

155
00:20:07,680 --> 00:20:19,040
on an intuitive level that it's not a result of a typical natural phenomena. And I think that's what

156
00:20:20,320 --> 00:20:28,640
it's very important evidence for us to say that there are potentially a fraud happening.

157
00:20:28,640 --> 00:20:35,360
I can see that. And just to keep talking about different distributions, it also would look to me

158
00:20:35,360 --> 00:20:41,920
like, you know, perhaps lab 10 has some sort of truncated distribution. It looks like they are

159
00:20:41,920 --> 00:20:49,040
just starting their count at like three for some reason. And then lab, you know, 2908,

160
00:20:49,040 --> 00:20:59,440
their distribution looks just entirely skewed to the left. Yes. This is different because

161
00:20:59,440 --> 00:21:06,320
labs use different methods and they have different limit of detection on the lower end, which is

162
00:21:07,280 --> 00:21:16,160
what I think makes a kind of comparison of mean rather difficult because in this type of

163
00:21:16,160 --> 00:21:23,440
distribution, the lower end will skew the number a lot. And if they have a different cutoff,

164
00:21:23,440 --> 00:21:31,120
that will change their mean. And it's so it might not be representative. I think below quantifiable

165
00:21:32,320 --> 00:21:41,040
BQL wouldn't be there. It would probably be zero. If it was to be put in a number.

166
00:21:42,640 --> 00:21:52,560
Yes. Yeah. I mean, there are well, when you do a analysis, when you try to find a molecule

167
00:21:52,560 --> 00:21:59,920
that's of a very less amount from something, there are usually from a chemistry perspective,

168
00:21:59,920 --> 00:22:08,000
two thresholds. One is limit of detection and one is limit of quantification. So if there is

169
00:22:08,000 --> 00:22:15,120
enough of a response of that molecule that we're detecting, okay, that's above limit of detection.

170
00:22:15,120 --> 00:22:25,120
So we know that there are some of the molecules in the sample, but there also is a gap that's

171
00:22:26,160 --> 00:22:32,480
between the limit of detection and limit of quantification. If the response is, although

172
00:22:32,480 --> 00:22:43,680
it's there, but it's not of enough magnitude, we won't be able to have a conclusive count of

173
00:22:43,680 --> 00:22:49,680
the thing that they're trying to detect. So for example, limit of detection might be three,

174
00:22:49,680 --> 00:22:59,920
limit of quantification might be 10. So anything less than 10 above three will have a result of,

175
00:22:59,920 --> 00:23:05,280
okay, it's above detection limit. We know that there is some amount of it in the sample,

176
00:23:05,280 --> 00:23:13,440
but we don't know how much. Yasha was talking about this gap. I love that you compiled this data and

177
00:23:13,440 --> 00:23:19,680
analyzed it like this because this is where, remember there's two sides of the market, right?

178
00:23:19,680 --> 00:23:27,280
There's the supply side and the consumer side. And the suppliers are often really concerned about,

179
00:23:27,280 --> 00:23:33,120
so for example, a lot of the talk in the legislation is about batch size, like how big should the batch

180
00:23:33,120 --> 00:23:41,040
be? But this is a good perspective from the consumer side in that, wait, before you start

181
00:23:41,040 --> 00:23:47,840
working on all of these other things, maybe go back and iron out some of the other analyses.

182
00:23:53,120 --> 00:24:00,960
To long story short, I don't think anybody's even talking about the labs testing microbes

183
00:24:00,960 --> 00:24:07,200
differently. So if you just mention this, that hey, it doesn't look like there's a

184
00:24:07,840 --> 00:24:17,520
uniform way that labs are measuring, or at least their outcomes are, we think they look different.

185
00:24:18,480 --> 00:24:24,640
Could you help explain this? Or maybe the lab should focus on that. So I think this is brilliant,

186
00:24:24,640 --> 00:24:30,960
brilliant analysis. If you want, I could kind of, if you're okay with it, I may change gears and just

187
00:24:30,960 --> 00:24:39,680
sort of, I guess, extend your analysis by continuing looking at these lab results and try to draw just

188
00:24:39,680 --> 00:24:48,720
a completely wild, different insight in a whole different realm in genetics. And this is what's

189
00:24:48,720 --> 00:24:54,640
fun about this, right? So it's the same data set. Isaac and I are working on the same lab results out

190
00:24:54,640 --> 00:25:02,160
of Washington state. And here Isaac's covered a structural difference between how the labs are

191
00:25:02,160 --> 00:25:11,600
testing microbes, which of course has implications. And now I'll just sort of do a fun little

192
00:25:12,960 --> 00:25:17,920
demonstration of another way you can look at the data. So I'm going to go ahead and take over the

193
00:25:17,920 --> 00:25:24,880
screen, Isaac. So just to give you a quick background, always just trying to pin my analysis

194
00:25:24,880 --> 00:25:35,920
somewhere in science, and been really interested in genetics. So just here was a, I've been always

195
00:25:35,920 --> 00:25:44,000
trying to replicate cool figures. And I've been wanting to do a timeline of strains for the longest

196
00:25:44,000 --> 00:25:51,840
time. Now, when did various strains come into existence? And we finally compiled enough data

197
00:25:51,840 --> 00:25:57,920
that we can do just that. We're not going to walk away with as cool of a figure as this,

198
00:25:57,920 --> 00:26:06,000
but it will be in the spirit of this morphology tree. Without further ado, we've got a bunch of

199
00:26:06,000 --> 00:26:15,600
lab results here. We've got just shy of 53,000. We've got 52,809. Just to start showing you,

200
00:26:15,600 --> 00:26:22,880
in fact, another cannabis data science member taught me this, just start looking at the data.

201
00:26:26,240 --> 00:26:32,400
Let's look at this one first. Just start looking at the data, counting it. And that's a really

202
00:26:32,400 --> 00:26:40,000
good first step of understanding what's happening. If we just look at all the lab tests that were

203
00:26:40,000 --> 00:26:47,520
created, it looks like some of them were dated prior to 2022. And we've got them going through,

204
00:26:48,240 --> 00:26:59,760
I think we can find the last lab result. So we know the last lab result occurred on December 12.

205
00:26:59,760 --> 00:27:10,320
So we have data going through 2022, December 2. And we have lab results that are dated to

206
00:27:10,320 --> 00:27:18,160
the beginning of 2018. So this makes me think that perhaps people were entering in old lab results

207
00:27:18,160 --> 00:27:27,040
when the CCRS was enacted, which was late November, early December of 2021. This is always

208
00:27:27,040 --> 00:27:34,400
a tricky part with data science, figuring out what's your actual timeline of analysis.

209
00:27:34,400 --> 00:27:40,240
One second, let's see if we can't make this figure a little bit bigger. We'll have to restrict to a

210
00:27:40,240 --> 00:27:49,200
timeline. I figured, okay, let's look at 2022. Well, as you can see, it's a little bit anomalous.

211
00:27:49,200 --> 00:27:56,720
And so I'll explain to you what I think is going on. And this is why being a data scientist

212
00:27:56,720 --> 00:28:03,520
involves pulling from many different disciplines. And one of those disciplines that I love to pull

213
00:28:03,520 --> 00:28:11,600
from is being a historian. So really, if you're a good data scientist, you should go back and

214
00:28:11,600 --> 00:28:18,400
try to find news bulletins that the Washington State Liquor and Cannabis Board issued. Because

215
00:28:19,440 --> 00:28:25,120
I was following them at the time, but I'll need to dig them back up. I'm fairly certain that people

216
00:28:25,120 --> 00:28:34,240
had a window for when they could start entering data into the system. So they may have said, okay,

217
00:28:34,240 --> 00:28:41,760
you have until the end of March to have all your data entered into the CCRS. So as you can see,

218
00:28:42,320 --> 00:28:50,960
between the big start of 2022 and April, you have a lot of data entry. So I don't know if this is

219
00:28:50,960 --> 00:28:57,760
representative of the number of lab results that are happening on a day-to-day basis. This may have

220
00:28:57,760 --> 00:29:03,360
just been people entering in a lot of historic lab results. So we may need to take that into

221
00:29:03,360 --> 00:29:12,240
consideration. But it looks like, okay, you know, it starts to stabilize around April. And then this

222
00:29:12,240 --> 00:29:22,400
may be your typical daily number of lab tests. So this is, I just love simple statistics. So

223
00:29:23,600 --> 00:29:31,360
count is a statistic. And so this is just a count of lab results by day in Washington State.

224
00:29:31,360 --> 00:29:37,520
And you see, okay, you know, about 100 samples are getting tested every day in the state. As you can

225
00:29:37,520 --> 00:29:43,920
see, there's a little bit of a time effect, maybe a little bit of a lull during the summer. And looks

226
00:29:43,920 --> 00:29:50,960
like things may be picking up in the winter. And just to kind of show you some cool things, since

227
00:29:51,680 --> 00:29:59,840
it's a meetup after all, I realized what you can do is you can group these by month without too much

228
00:29:59,840 --> 00:30:06,960
effort here. So I think we may even just, yes, so that way you could find out the number of lab

229
00:30:06,960 --> 00:30:16,000
results that are happening per month, or I think you can even do per week, which is a fine frequency

230
00:30:16,000 --> 00:30:23,120
for predictions. Weekly is my favorite for forecasting. So this way you can see how many

231
00:30:23,120 --> 00:30:31,920
lab results are happening on a weekly basis. So around 500 lab results a week. Cool. Well,

232
00:30:31,920 --> 00:30:40,640
as always, I like to go micro. So we started aggregate, we started just looking at how many

233
00:30:40,640 --> 00:30:48,480
lab results were happening. Well, now I'd like to go micro. And it's actually kind of funny that

234
00:30:48,480 --> 00:30:56,400
Isaac was just talking about microbes. And so we'll zoom in now on a particular strain, keeping it

235
00:30:56,400 --> 00:31:02,560
keeping in mind that this can generalize to a bunch of different strains. And I just kind of want to

236
00:31:02,560 --> 00:31:09,680
see if we can draw some particular insights here. So for example, I keep talking about runts. So

237
00:31:09,680 --> 00:31:16,880
this was parent rumor has it that this was a strain that originates somewhere out of the San

238
00:31:16,880 --> 00:31:27,040
Francisco Bay Area. And so I'm curious, I was curious to start doing sort of genetic lineage

239
00:31:27,040 --> 00:31:34,400
tracing of strains and seeing how far back we can go. So we still have to do some of the really

240
00:31:34,400 --> 00:31:39,760
ancient stuff. And so going back to some of the early hazes, so I've got some cool history to

241
00:31:39,760 --> 00:31:44,800
share with you there. But I figured, okay, let's start with the present day of what we have just

242
00:31:44,800 --> 00:31:54,080
picking runs for no particular reason, you can look at all the different varieties of runs. So

243
00:31:54,080 --> 00:32:05,200
here, I just got a list of every different type of runs that's been grown in Washington State. So

244
00:32:05,200 --> 00:32:11,200
of course, you just have just regular runs. Now you've got white runs, knockout runs, pink runs,

245
00:32:11,200 --> 00:32:22,400
pink runs, runs F4, runs and cream, your red runs, ripper runs, gelato runs. So this is really cool,

246
00:32:22,400 --> 00:32:28,240
right? So I would think like you could start to do lineage tracing this way. And the way I would

247
00:32:28,240 --> 00:32:36,000
do it is okay. And this is Yasha where I was saying, this is a really peculiar, interesting

248
00:32:36,000 --> 00:32:45,280
value added to lab results in that, how do we know when a strain came about? Well, it could be when

249
00:32:45,280 --> 00:32:53,200
it sold, but what if, you know, banana runs or ripper runs never sells? Well, I was thinking,

250
00:32:53,760 --> 00:33:02,800
you know, the first documentation that we have that banana runs exists, or the lab results.

251
00:33:02,800 --> 00:33:14,160
And so this may be the first known occurrence of gelato runs. And so yes, we may not be able to pin

252
00:33:14,160 --> 00:33:21,120
down runs itself yet until maybe we start looking at some California data. And then we can basically

253
00:33:21,120 --> 00:33:29,120
find, you know, the first known test of runs in California. And then, you know, you could find the

254
00:33:29,120 --> 00:33:36,320
first known test of gelato in California. Those ones are probably pretty old. But then you could

255
00:33:36,320 --> 00:33:43,120
say, oh, well, you know, here's the first cross of gelato runs. Of course, you know, other people

256
00:33:43,120 --> 00:33:49,760
in the country may have crossed this one. But, you know, I just think this is just an interesting

257
00:33:49,760 --> 00:33:57,120
place that we can begin. Enough of that. Let's just start looking at some figures here. So here's

258
00:33:57,120 --> 00:34:07,600
just the number of runs tests. So this is anything that has the runs in its name. And as you can see,

259
00:34:08,080 --> 00:34:16,080
we may not want to necessarily. And this is where I was saying, the timeline selection is of critical

260
00:34:16,080 --> 00:34:23,680
importance. So for example, if you were forecasting the popularity of runs, well, first off, I would

261
00:34:23,680 --> 00:34:30,480
use sales. So I think we should eventually tie these to sales to see, you know, what's the total

262
00:34:30,480 --> 00:34:40,720
dollar amount of runs being sold over time. So this is just number of tests. So it's a proxy for

263
00:34:40,720 --> 00:34:48,320
popularity. But, you know, if we were using this whole time period for prediction, we may forecast

264
00:34:48,320 --> 00:34:55,440
that, you know, runs is going to lose all of its popularity in 2023. But that may be because our

265
00:34:55,440 --> 00:35:02,240
data is skewed. There's measurement error. People were entering in old data. So if we were going to

266
00:35:02,240 --> 00:35:08,800
do forecasting, you know, we may actually be better off just picking, you know, the last six months

267
00:35:08,800 --> 00:35:16,160
or so. Just use the last six months of data and forecast the popularity of runs moving forward.

268
00:35:16,160 --> 00:35:24,240
Cool. Well, now here's what you came for. So this is essentially what we wanted to try to build. So

269
00:35:24,240 --> 00:35:32,400
this is the model. So this is, you know, how we'll model the data. You know, the code is on GitHub.

270
00:35:32,400 --> 00:35:38,000
And I found some people are interested in the code. So if you're interested in the code, go and

271
00:35:38,000 --> 00:35:46,640
pause through it. There's nothing fancy and it's open source. So you're welcome to pull from it and

272
00:35:46,640 --> 00:35:53,040
use it how best you please. What's more interesting to me are the visualizations. First, I'll just

273
00:35:53,040 --> 00:36:00,000
explain the logic of what I've done. And then the visualization. So we've got the beginning date and

274
00:36:00,000 --> 00:36:07,360
date. We know that we want to look at runs. Well, what we can do, we can get every lab test that

275
00:36:07,360 --> 00:36:17,280
contains the name runs, and we can find the first lab result, the first date, created date for each

276
00:36:17,280 --> 00:36:24,960
lab result for each of these varieties. I mean, I've called this the Genesis. So we can see,

277
00:36:24,960 --> 00:36:35,840
okay, the soap in runs was first tested on May 25th, 2022. You know, apple fritter runs was

278
00:36:35,840 --> 00:36:46,240
first tested on December 13th, 2022. There's going to be too many of them to plot aesthetically.

279
00:36:46,240 --> 00:36:57,680
There's 178 varieties of runs. But I'm just going to plot a random 15 to get you an idea of what

280
00:36:57,680 --> 00:37:09,440
this looks like. So here we have it. So as I said, it's not beautiful. It's not your typical

281
00:37:09,440 --> 00:37:18,080
phylogenetic tree, but it's an effort. So here we have a chronological order of when

282
00:37:18,080 --> 00:37:25,600
various varieties of runs were first tested in Washington state. So we see some of the more

283
00:37:25,600 --> 00:37:34,560
recent varieties, I think the pink runs, the golden runs. And as time goes by, we see,

284
00:37:34,560 --> 00:37:40,320
and this is where I was talking about the potential importance of this. Look at this. What was the

285
00:37:40,320 --> 00:37:51,360
first variety of runs tested in 2022? It was the greasy runs. And look at this, shortly after you

286
00:37:51,360 --> 00:37:59,680
have greasy runs number two. And so I was thinking this, we were talking about patenting plant

287
00:37:59,680 --> 00:38:07,280
varieties. Well, one way you could say, well, I was the first one to have this plant variety tested.

288
00:38:07,280 --> 00:38:15,680
So we could actually find who was the first cultivator of greasy runs. And it turns out

289
00:38:15,680 --> 00:38:23,280
it was Red Ridge Farms. And then look at this, not shortly after you have a producer of

290
00:38:24,320 --> 00:38:33,120
greasy runs number two, and it's a different cultivator, Sky Standard Gardens. So probably,

291
00:38:33,120 --> 00:38:40,080
well, who am I to conjecture? For all we know, there could be an intense rivalry between these

292
00:38:40,080 --> 00:38:49,920
cultivators. And one may be sore that the other one stole the name greasy runs first.

293
00:38:50,480 --> 00:38:55,040
Because remember, these are getting entered into the Washington State traceability system.

294
00:38:55,040 --> 00:39:00,880
I'm not certain if you can have unique strain names or if different people can have other

295
00:39:00,880 --> 00:39:08,800
strain names. But remember, our criterion is all about first tested. So it actually wouldn't matter

296
00:39:08,800 --> 00:39:17,520
if Sky Standard Gardens did test the greasy runs. It just matters who tested it first. So this was

297
00:39:17,520 --> 00:39:24,080
just a fun analysis that I thought you could do. And, you know, and I was just going to demonstrate

298
00:39:24,720 --> 00:39:31,200
you can have a lot of a lot of fun with this. So for example, you know, we talked about wedding

299
00:39:31,200 --> 00:39:39,840
cake. So you can find all the different varieties of wedding cake. And for example, I was a big

300
00:39:39,840 --> 00:39:48,400
Jack fan. So you can find all the different Jacks that people are producing to Tahoe Jack,

301
00:39:48,400 --> 00:39:55,360
Jack Carrere, and Gelato. And then as I was saying, you know, we're trying to track down some of the

302
00:39:55,360 --> 00:40:03,200
hazes. So you can also find, say, different varieties of hazes that people were producing,

303
00:40:03,200 --> 00:40:10,080
and just start to get a timeline for these. So that was sort of my main analysis. As I was saying,

304
00:40:10,080 --> 00:40:16,880
it's kind of just light and fun. We know that strain names in and of themselves don't mean

305
00:40:18,080 --> 00:40:22,240
just a name at the end of the day, right? So as I was saying, right,

306
00:40:22,240 --> 00:40:30,160
two competing farms, maybe if you see your neighbor produce greasy runts, you know, the next week,

307
00:40:30,160 --> 00:40:36,320
you've labeled something greasy runts number two, when you go and get that tested, they may be

308
00:40:36,960 --> 00:40:42,240
chemically quite different, you know, greasy runs and greasy runs number two may be quite different.

309
00:40:42,240 --> 00:40:50,480
But if people if Red Ridge Farms makes a lot of clones of their greasy runts, then all of those

310
00:40:50,480 --> 00:40:56,320
clones, there'll be a slight variation, right, there'll be the environmental variation. However,

311
00:40:56,320 --> 00:41:04,880
all those clones will have the same genetics, and will produce relatively chemically similar plants.

312
00:41:04,880 --> 00:41:11,600
I don't know, just just something to think about food for thought. But as I was saying,

313
00:41:11,600 --> 00:41:21,200
this this all comes out of this data set here, where we just looked at one column. Well, here I

314
00:41:21,200 --> 00:41:28,720
use two columns, more or less, I use the date that these various lab results were created,

315
00:41:28,720 --> 00:41:36,640
and then I used the strain name. So you know, some strains are more popular than others. However,

316
00:41:36,640 --> 00:41:43,520
look at all of this rich data here, I haven't even really touched on any of these lab results.

317
00:41:43,520 --> 00:41:49,920
Remember, last week, we basically just looked at okay, what pesticides are we detecting? Well,

318
00:41:49,920 --> 00:41:58,000
Isaac took it one step further. And now Isaac's not only looking at microbes, but Isaac's also

319
00:41:58,000 --> 00:42:04,960
looking at if the value is greater than or less than the Washington state limit. So Isaac is

320
00:42:04,960 --> 00:42:11,200
so Isaac has augmented this data with the Washington state limits, which is phenomenal,

321
00:42:11,200 --> 00:42:18,960
and has done a fruitful analysis. And so, as I said, I don't know how fruitful the strain

322
00:42:18,960 --> 00:42:24,960
analysis was, I think Isaac's analysis was super fruitful. So hopefully, this has gotten all of

323
00:42:24,960 --> 00:42:29,600
your minds thinking about some some cool ways that you can use the data. So I'm going to stop

324
00:42:29,600 --> 00:42:35,040
presenting and see if any of you have any questions, thoughts or comments. And you're

325
00:42:35,040 --> 00:42:41,360
welcome to chime in. That was fascinating. Oh, I have a bunch of notes that I took on it.

326
00:42:42,560 --> 00:42:46,960
I still want to hog the microphone. Oh, please, please share any thoughts that come to mind.

327
00:42:46,960 --> 00:42:55,040
So the first one, you showed a graph of the timeline for runs. And it seemed that there was

328
00:42:55,040 --> 00:43:03,760
four strains that were tested within a couple of weeks, and then three months passed, and four

329
00:43:03,760 --> 00:43:13,040
more, which is a growth cycle away. As in, it's possible that the same folks grew four, saw the

330
00:43:13,040 --> 00:43:19,200
results and then decided to grow them again. But the strain name slightly different between the

331
00:43:19,200 --> 00:43:24,080
first growth cycle and the second. And my curiosity is whether through the data would be able to see

332
00:43:24,080 --> 00:43:34,560
whether the change was, was the yield not what they wanted? Was the potency not what they expected?

333
00:43:34,560 --> 00:43:40,560
Or were there microbiological problems, which is why they wanted some sort of change in the genetics?

334
00:43:42,080 --> 00:43:44,640
Or was it none of those and they just wanted to try out other stuff?

335
00:43:46,160 --> 00:43:51,360
I love how you're thinking, Yasha. You're thinking really like a good microeconomist,

336
00:43:51,360 --> 00:43:57,760
because really you're going to have to dive deep into this. So, for example, you may want to start

337
00:43:57,760 --> 00:44:06,400
looking at specific licensees. So, for example, look at Red Ridge Farms. And exactly, we, the

338
00:44:06,400 --> 00:44:14,800
data is there, it's just going to take some heavy curation. So I think you can find the yield from,

339
00:44:14,800 --> 00:44:22,960
so you can basically calculate how much did Greasy Runs yield? Well, actually, that may be a certain

340
00:44:22,960 --> 00:44:29,200
batch size. So depending on how many tests they've done, we may actually be able to estimate yield

341
00:44:29,200 --> 00:44:34,320
that way, but we can probably, depending on the size of the cultivation, but long story short,

342
00:44:34,320 --> 00:44:42,720
you could probably find yield. You could also find sales. So maybe certain strains you're selling

343
00:44:42,720 --> 00:44:51,600
better than others. And then you also alluded to, this is going to be a difficult web to unweave,

344
00:44:52,080 --> 00:45:01,280
because the varieties are cropping up all over the place. So that's why I wasn't, this isn't quite

345
00:45:01,280 --> 00:45:05,840
a phylogenetic tree, because it's not really saying that they're related. That's just saying

346
00:45:05,840 --> 00:45:11,680
when they occurred. So it's like, who knows? Who knows if these are even coming from the

347
00:45:11,680 --> 00:45:19,040
same original Runs stock clones, or maybe there's some Runs seeds out there. And then as we know

348
00:45:19,040 --> 00:45:27,520
about seeds, each seed will have genetic variants. And then I need to learn more about, and that's

349
00:45:27,520 --> 00:45:34,400
what I'm trying to do. I'm trying to learn more about, this is Sensimilia tips, this old book on

350
00:45:34,400 --> 00:45:40,080
cultivation. So I'm trying to learn more about standard cultivation techniques, because exactly,

351
00:45:40,080 --> 00:45:48,720
I'm trying to find out what is the life cycle of these plants? How quickly can you cross pollinate

352
00:45:49,520 --> 00:45:56,080
and create a new variety? So these are all super interesting questions. So I love how you're

353
00:45:56,080 --> 00:46:00,560
thinking. That's the purpose of the analysis after all. And that's the point of the Meetup Group.

354
00:46:00,560 --> 00:46:05,440
This was sort of a quick, dirty analysis, right? If you're doing something for publication, of

355
00:46:05,440 --> 00:46:12,800
course, or for a business, of course, do it much more rigorously. But this was just to sort of get

356
00:46:12,800 --> 00:46:20,480
your brains thinking about all the cool possibilities that are just laying to be explored. Well,

357
00:46:21,440 --> 00:46:29,040
we've kind of reached the end a little soon today. I know we normally go long. I may go ahead and

358
00:46:29,040 --> 00:46:35,760
wrap up a little early today, unless, like I said, there's still time for some more thoughts, comments,

359
00:46:35,760 --> 00:46:43,840
and questions. But I think we've covered a lot of ground. And there's a, I don't know who to credit,

360
00:46:43,840 --> 00:46:50,560
but there's a useful tip that when you're giving a presentation or a talk, that it doesn't hurt to

361
00:46:50,560 --> 00:46:56,720
let the audience go five minutes early, because people kind of appreciate you for that. So that'll

362
00:46:56,720 --> 00:47:04,000
be one insight for the day. And then the other insight was simply, you know, it's better to start

363
00:47:04,000 --> 00:47:11,280
now than never. And that was one thing that I was thinking about. Yes, we would love to be able to

364
00:47:11,280 --> 00:47:20,480
trace these strains back further, but we can at least start now. So now people in the future may

365
00:47:20,480 --> 00:47:30,240
thank us for starting to track strain origins in 2022. You know, now is better than never. So now

366
00:47:30,240 --> 00:47:38,000
we can, you know, start to piece out what exactly is descendant from greasy runts. And the data's

367
00:47:38,000 --> 00:47:44,800
there. I think if you want, you can dig into, you know, was this greasy runts grown from clone,

368
00:47:44,800 --> 00:47:50,320
or was it grown from seed? I think the data may be there if you're ambitious and you want to

369
00:47:50,320 --> 00:47:57,760
dig enough. But I think you're all awesome data scientists. So I want to thank you all. Thank you

370
00:47:57,760 --> 00:48:03,200
all for coming, lending your eyes, your ears, your brilliant minds. We're advancing cannabis science,

371
00:48:03,200 --> 00:48:08,560
one molecule at a time. I don't know, I'm tickled with the progress that we've made. So thank you

372
00:48:08,560 --> 00:48:13,120
all. And hopefully we can keep the conversation going throughout the week and then rendezvous

373
00:48:13,120 --> 00:48:20,560
again next week and explore some more cannabis data.