1
00:00:00,000 --> 00:00:09,600
Welcome to cannabis data science, the 73rd one.

2
00:00:09,600 --> 00:00:12,560
So we've been doing this for about a year and a half.

3
00:00:12,560 --> 00:00:17,360
So thank you all for lending your ears and making this happen.

4
00:00:18,800 --> 00:00:26,320
Long story short was working on an automation task because we actually found an awesome

5
00:00:26,320 --> 00:00:27,920
source for lab results.

6
00:00:27,920 --> 00:00:34,600
So we've been looking at lab results for a handful of samples and strains, and I figured,

7
00:00:34,600 --> 00:00:36,760
well, let's just go straight to the source.

8
00:00:36,760 --> 00:00:38,760
And so today we'll do that.

9
00:00:38,760 --> 00:00:41,400
Go straight to the source, get the lab results.

10
00:00:41,400 --> 00:00:46,280
Ended up spending a bunch of my time doing automation.

11
00:00:46,280 --> 00:00:50,760
So we'll finally have an awesome data set for you.

12
00:00:50,760 --> 00:00:57,680
But then, you know, you fine folk may have to help me in the coming weeks with some of

13
00:00:57,680 --> 00:00:58,680
the statistics.

14
00:01:00,960 --> 00:01:05,880
Essentially, the topics of the day are, you know, web scraping.

15
00:01:05,880 --> 00:01:11,880
And then just to introduce some fun data science topics, I thought, OK, we could also talk

16
00:01:11,880 --> 00:01:13,480
about margin of error.

17
00:01:13,480 --> 00:01:19,080
That's actually pertinent to the data we'll be collecting in a couple of sessions.

18
00:01:19,080 --> 00:01:25,400
And then we'll also look at secure hash algorithms just because let's just spice it up with

19
00:01:25,400 --> 00:01:30,080
something interesting and fun and just, you know, fun fact.

20
00:01:30,080 --> 00:01:33,680
These little things end up having so many uses.

21
00:01:33,680 --> 00:01:40,040
So turns out I was just reading and looks like they were formally created by the NSA

22
00:01:40,040 --> 00:01:41,640
in 2002.

23
00:01:41,640 --> 00:01:46,520
And then once you get these new technologies, it's just amazing what people will think to

24
00:01:46,520 --> 00:01:47,760
do with that.

25
00:01:47,760 --> 00:01:54,920
And so ultimately, you know, these hashing algorithms are what's underlying Bitcoin.

26
00:01:54,920 --> 00:02:00,760
And this is what the authentication system for the CanLytics website uses.

27
00:02:00,760 --> 00:02:04,720
We use SHA-256 hashes.

28
00:02:04,720 --> 00:02:10,320
So this is just pretty standard technology that people use.

29
00:02:10,320 --> 00:02:14,160
But anyways, I'll just show you how you can use them today, too.

30
00:02:14,160 --> 00:02:18,120
So what are we even going to be doing today?

31
00:02:18,120 --> 00:02:23,320
There's a fantastic laboratory in Michigan, PSI Labs.

32
00:02:23,320 --> 00:02:34,320
And through December, the end of December of 2021, they've been diligently posting their

33
00:02:34,320 --> 00:02:37,480
laboratory results.

34
00:02:37,480 --> 00:02:45,600
And as you can see, you can get quite detailed information for the various samples that have

35
00:02:45,600 --> 00:02:46,600
been tested.

36
00:02:46,600 --> 00:02:49,720
So this is a Marshmallow OG.

37
00:02:49,720 --> 00:02:53,400
It's awesome that they've taken an image of this.

38
00:02:53,400 --> 00:02:59,440
And I'll actually introduce to you some fun side projects you could potentially do.

39
00:02:59,440 --> 00:03:02,480
But you've got some images.

40
00:03:02,480 --> 00:03:05,120
They've got for various samples.

41
00:03:05,120 --> 00:03:10,320
It depends on what people requested or what analyses were performed.

42
00:03:10,320 --> 00:03:18,560
But you may have pesticides.

43
00:03:18,560 --> 00:03:24,800
Looks like there was no undetect across the board.

44
00:03:24,800 --> 00:03:28,720
Microdials, metals.

45
00:03:28,720 --> 00:03:34,120
Lots of data.

46
00:03:34,120 --> 00:03:43,040
From what I could tell, PSI Labs does its state that they don't want robots on their

47
00:03:43,040 --> 00:03:45,960
website or this or that.

48
00:03:45,960 --> 00:03:52,120
I was thinking from an accessibility point of view, I made this joke the other day that

49
00:03:52,120 --> 00:03:55,000
I'm half man, half Python.

50
00:03:55,000 --> 00:04:01,000
So I think you could almost make an accessibility claim that if your preferred way to browse

51
00:04:01,000 --> 00:04:06,480
the web is through Python, then so be it.

52
00:04:06,480 --> 00:04:14,120
And so what's cool about being half Python is that a human may look at this and you can

53
00:04:14,120 --> 00:04:15,520
only do so much.

54
00:04:15,520 --> 00:04:18,240
You can only remember so many photos.

55
00:04:18,240 --> 00:04:25,000
You can only remember so many lab names or sample names.

56
00:04:25,000 --> 00:04:40,200
But with Python, we've got a stunningly good memory and we can work incredibly quickly.

57
00:04:40,200 --> 00:04:46,000
So let's dip into some of our extraordinary abilities here.

58
00:04:46,000 --> 00:04:53,320
Because basically the idea is there's hundreds of pages here and if you just wanted to go

59
00:04:53,320 --> 00:04:58,560
through them by one, it would be hard to review all of these samples.

60
00:04:58,560 --> 00:05:04,880
And the idea is if you reviewed all these samples, you have a good idea about what's

61
00:05:04,880 --> 00:05:14,800
the distribution of can of noise, flowers, concentrates, you name it.

62
00:05:14,800 --> 00:05:32,840
Because every type of sample...

63
00:05:32,840 --> 00:05:38,160
So basically the idea is we've been using requests a lot.

64
00:05:38,160 --> 00:05:42,640
We just request a page and that's fine and dandy.

65
00:05:42,640 --> 00:05:48,160
But I guess I won't go...

66
00:05:48,160 --> 00:05:50,760
I'm still with you all, right?

67
00:05:50,760 --> 00:05:54,800
So for sure it is.

68
00:05:54,800 --> 00:06:05,800
If you basically look at everything it's sent to you, basically what ends up happening is

69
00:06:05,800 --> 00:06:17,240
this whole body is rendered with JavaScript after it arrives in your browser.

70
00:06:17,240 --> 00:06:21,760
So it's hard to just request this in your traditional manner.

71
00:06:21,760 --> 00:06:29,360
So basically what we're going to do is we're just going to use Chrome through Python.

72
00:06:29,360 --> 00:06:39,280
So that's why I said we're sort of half man or half woman, half Python here, because we're

73
00:06:39,280 --> 00:06:41,840
still going to use Chrome.

74
00:06:41,840 --> 00:06:54,520
We're just going to drive Chrome through our Python console here.

75
00:06:54,520 --> 00:07:04,840
And this is going to be an interesting script to go through today since it is web based.

76
00:07:04,840 --> 00:07:09,320
So hopefully we've got enough bandwidth to get through here.

77
00:07:09,320 --> 00:07:19,200
But the idea is if you actually start looking at what's going on programmatically, you'll

78
00:07:19,200 --> 00:07:30,480
see that it looks like the developer may have coded in 492 pages, but if you start looking

79
00:07:30,480 --> 00:07:41,200
at how the website behaves, you'll see that there's actually thousands of pages of lab

80
00:07:41,200 --> 00:07:44,480
results.

81
00:07:44,480 --> 00:07:52,240
So here's lab results going back to 2019.

82
00:07:52,240 --> 00:07:54,480
Just keeps going and going and going.

83
00:07:54,480 --> 00:08:01,040
And so I realized that there's almost 5,000 pages of lab results.

84
00:08:01,040 --> 00:08:08,400
So estimated 10 lab results per page.

85
00:08:08,400 --> 00:08:13,640
That's about, and that's why I said you're going to get your hands on about 50,000 lab

86
00:08:13,640 --> 00:08:23,480
results here, because we're going to get just shy of that, about 49,000 lab results.

87
00:08:23,480 --> 00:08:28,560
And I will deliver all these to you as promised.

88
00:08:28,560 --> 00:08:32,600
As you'll see, the data collection process is a little lengthy.

89
00:08:32,600 --> 00:08:35,400
So I do have a lot.

90
00:08:35,400 --> 00:08:38,200
And then I'll basically just start dripping these out to you.

91
00:08:38,200 --> 00:08:43,640
But everyone who signed up will get every lab result that is collected.

92
00:08:43,640 --> 00:08:46,680
So how do you know about doing this?

93
00:08:46,680 --> 00:08:49,840
Well, what's cool is, so check this out.

94
00:08:49,840 --> 00:08:57,720
We'll basically just open up a Chrome browser here.

95
00:08:57,720 --> 00:09:07,360
And we'll want to define what page we want to point it at.

96
00:09:07,360 --> 00:09:20,340
So we basically say, OK, Chrome, go to PSI Labs website, please, and go to page 4200.

97
00:09:20,340 --> 00:09:30,840
And so now Chrome will diligently perform this task for us, albeit it may be a little

98
00:09:30,840 --> 00:09:38,960
slow, a little slow out of the gate just because of all the tasks I'm asking my computer to

99
00:09:38,960 --> 00:09:49,360
do at the moment, such as streaming to all of you fine folks.

100
00:09:49,360 --> 00:09:53,400
So hopefully this will get there sooner rather than later.

101
00:09:53,400 --> 00:10:01,080
So basically, I went through one of these pages here.

102
00:10:01,080 --> 00:10:10,320
So if you look at the Marshmallow OG page and this page, and you just say, OK, what

103
00:10:10,320 --> 00:10:15,000
are all the unique data points that we can collect?

104
00:10:15,000 --> 00:10:18,480
Well, oh, so finally.

105
00:10:18,480 --> 00:10:24,720
So awesome, this page loaded page 4200.

106
00:10:24,720 --> 00:10:25,720
Fantastic.

107
00:10:25,720 --> 00:10:34,040
So that's why I was saying I think you could make a good excuse, not even an excuse, I

108
00:10:34,040 --> 00:10:42,640
think you could make a good case that why should a website say that you can't access

109
00:10:42,640 --> 00:10:46,700
it through, quote unquote, automation?

110
00:10:46,700 --> 00:10:59,040
Because it's basically just, what if that's just your preferred way to browse the web?

111
00:10:59,040 --> 00:11:01,840
That's how you want to access the web.

112
00:11:01,840 --> 00:11:09,160
Anywho, that's what we'll basically be doing.

113
00:11:09,160 --> 00:11:17,440
The other reason is why should we even collect this data?

114
00:11:17,440 --> 00:11:25,840
Well basically, I had this almost like a panic slash nightmare on Monday because I couldn't

115
00:11:25,840 --> 00:11:28,680
find this website.

116
00:11:28,680 --> 00:11:38,960
And then it started to make me realize that we could easily lose all this data to history.

117
00:11:38,960 --> 00:11:44,440
You know, they were only getting the lab results just stopped at 2021.

118
00:11:44,440 --> 00:11:56,100
So maybe they updated their software or maybe their software developer is no longer developing

119
00:11:56,100 --> 00:11:57,520
on this application.

120
00:11:57,520 --> 00:12:08,520
So there's many possibilities and long story short is I hate to see such phenomenal data

121
00:12:08,520 --> 00:12:10,520
on this critical thing.

122
00:12:10,520 --> 00:12:15,400
We all think that cannabis can do a lot of positive good in this world.

123
00:12:15,400 --> 00:12:21,000
So if all this data just vanished, that would just be a disaster.

124
00:12:21,000 --> 00:12:28,920
So I almost think future generations may look back on us and be like, you didn't even write

125
00:12:28,920 --> 00:12:32,680
down all that data and you had all the tools.

126
00:12:32,680 --> 00:12:37,960
It was just sitting there right before you and you just didn't take the time to do it.

127
00:12:37,960 --> 00:12:45,360
So basically, I just sort of had this panic on Monday and realized that yes, PSI Labs

128
00:12:45,360 --> 00:12:51,680
probably got this data nice and diligently stored in their database.

129
00:12:51,680 --> 00:12:58,560
But it wouldn't hurt to just go ahead and create another archive of this data because

130
00:12:58,560 --> 00:13:05,080
I think this is data that in my opinion people are going to be looking at for years to come.

131
00:13:05,080 --> 00:13:11,880
So long story short, we can do our part and try to collect it.

132
00:13:11,880 --> 00:13:19,360
So and also I also kind of just see this as meeting people to get all the lab results

133
00:13:19,360 --> 00:13:22,960
online and accessible.

134
00:13:22,960 --> 00:13:23,960
That's it.

135
00:13:23,960 --> 00:13:31,160
Yeah, an API would be nice but we've got to test cannabis here.

136
00:13:31,160 --> 00:13:32,520
We did our part.

137
00:13:32,520 --> 00:13:37,080
And so basically, I'm saying, oh, fine, we'll meet you where you are.

138
00:13:37,080 --> 00:13:38,240
You did your part.

139
00:13:38,240 --> 00:13:41,520
You did a phenomenal job.

140
00:13:41,520 --> 00:13:46,040
We can take this and run it home for you.

141
00:13:46,040 --> 00:13:54,640
So long story short, we can basically just look at their HTML and start finding the various

142
00:13:54,640 --> 00:13:56,720
elements that we need.

143
00:13:56,720 --> 00:14:08,080
So basically, we can say, oh, on this first page, we can get these 10 samples just as

144
00:14:08,080 --> 00:14:11,240
a sanity check.

145
00:14:11,240 --> 00:14:14,440
We can look at this first card.

146
00:14:14,440 --> 00:14:17,680
So these are still sort of just like HTML elements.

147
00:14:17,680 --> 00:14:20,440
But I think we can print it out.

148
00:14:20,440 --> 00:14:23,200
So basically, this is all the text on the card.

149
00:14:23,200 --> 00:14:29,600
So we're looking at presidential by sensei.

150
00:14:29,600 --> 00:14:33,720
Sure enough, that's the first card right here.

151
00:14:33,720 --> 00:14:34,720
Presidential by sensei.

152
00:14:34,720 --> 00:14:37,160
It's actually sensei seven.

153
00:14:37,160 --> 00:14:39,560
Sure enough, sensei seven.

154
00:14:39,560 --> 00:14:43,000
And so that's what's nice about the computer, right?

155
00:14:43,000 --> 00:14:47,200
Us humans, we're real imperfect.

156
00:14:47,200 --> 00:14:58,400
But the idea is, as a computer, I think we underestimate how good their memory is, especially

157
00:14:58,400 --> 00:15:01,960
with these large data sets.

158
00:15:01,960 --> 00:15:11,960
So for example, I figured it'd probably be worthwhile to go ahead and deal with images

159
00:15:11,960 --> 00:15:19,800
in general with online images, these are PSI labs images.

160
00:15:19,800 --> 00:15:21,520
So they belong to them.

161
00:15:21,520 --> 00:15:26,520
So you can't really just download them and copy them around willy nilly.

162
00:15:26,520 --> 00:15:35,720
But in general, a link on the website, links on the web can be readily copied and shared.

163
00:15:35,720 --> 00:15:43,400
So basically, the idea is, you can readily link to their image.

164
00:15:43,400 --> 00:15:48,880
So you can save the URL to their image.

165
00:15:48,880 --> 00:15:56,920
And then if you ever want to go look at this image, say you're doing all your research

166
00:15:56,920 --> 00:16:04,080
and there's something just kind of funny about this presidential sample, you can go check

167
00:16:04,080 --> 00:16:05,340
it out.

168
00:16:05,340 --> 00:16:14,400
You can go look at their image and displaying it on your website, you have to be super careful.

169
00:16:14,400 --> 00:16:18,280
Typically, you'd want to give PSI labs credit.

170
00:16:18,280 --> 00:16:23,360
But generally, that's how Google and Google images works, right?

171
00:16:23,360 --> 00:16:29,560
They're just using links to images all over the internet.

172
00:16:29,560 --> 00:16:35,360
So generally, like I said, I try not to be as ruthless as some of these companies that

173
00:16:35,360 --> 00:16:36,440
you see.

174
00:16:36,440 --> 00:16:39,680
But in general, right?

175
00:16:39,680 --> 00:16:46,200
For example, in the presentation today, I think it's worthwhile showing you their image.

176
00:16:46,200 --> 00:16:50,160
So I'm not copying it or saving it.

177
00:16:50,160 --> 00:16:54,760
I'm just kind of showing it to you and talking about it.

178
00:16:54,760 --> 00:16:59,960
But funny enough, does anybody recognize what this is?

179
00:16:59,960 --> 00:17:00,960
Tri-combs?

180
00:17:00,960 --> 00:17:01,960
One more time.

181
00:17:01,960 --> 00:17:02,960
Tri-combs?

182
00:17:02,960 --> 00:17:03,960
Well, exactly.

183
00:17:03,960 --> 00:17:08,800
So all the little crystal-y things are tri-combs.

184
00:17:08,800 --> 00:17:13,320
I was thinking that there's so many plant parts here, right?

185
00:17:13,320 --> 00:17:14,760
The botanists could pick out so many.

186
00:17:14,760 --> 00:17:19,080
But basically, this whole thing is a calyx.

187
00:17:19,080 --> 00:17:25,800
So this is basically like one cannabis flower.

188
00:17:25,800 --> 00:17:28,200
And that's actually the Canlytics logo.

189
00:17:28,200 --> 00:17:31,360
It's just a calyx.

190
00:17:31,360 --> 00:17:41,040
And so basically, like a bud is just hundreds, if not thousands, of these beautiful little

191
00:17:41,040 --> 00:17:42,040
flowers.

192
00:17:42,040 --> 00:17:46,200
And yeah, you've got the calyx.

193
00:17:46,200 --> 00:17:51,920
I think the orange hairs are called stigma, perhaps.

194
00:17:51,920 --> 00:17:54,280
And then exactly the tri-combs.

195
00:17:54,280 --> 00:17:57,360
It's just covered in tri-combs.

196
00:17:57,360 --> 00:18:03,680
And when you start to look at pictures like this, it actually makes the brain kind of

197
00:18:03,680 --> 00:18:04,680
think.

198
00:18:04,680 --> 00:18:10,880
And that's actually where my question for our research for today came from is, I mean,

199
00:18:10,880 --> 00:18:17,360
look how covered in tri-combs this cannabis flower is.

200
00:18:17,360 --> 00:18:26,640
And so it kind of makes one think that could a cannabis breeder eventually breed a cannabis

201
00:18:26,640 --> 00:18:33,840
flower to basically just be one giant tri-comb?

202
00:18:33,840 --> 00:18:38,080
How big can these tri-combs get?

203
00:18:38,080 --> 00:18:46,040
So I was thinking from a biological point of view, it may actually be, we may be able

204
00:18:46,040 --> 00:18:48,640
to parse it out in the data.

205
00:18:48,640 --> 00:18:54,680
Are these tri-combs essentially getting bigger over time?

206
00:18:54,680 --> 00:19:03,340
And so the way we could basically say is, is the average cannabinoid concentration increasing

207
00:19:03,340 --> 00:19:04,340
over time?

208
00:19:04,340 --> 00:19:09,680
And then I was just going to throw out two real cool, quick machine learning projects

209
00:19:09,680 --> 00:19:11,320
you could do.

210
00:19:11,320 --> 00:19:14,480
You've got all these images.

211
00:19:14,480 --> 00:19:21,440
Well it's maybe sort of extraordinary how a computer could process an image.

212
00:19:21,440 --> 00:19:26,120
And so it's sort of like if you look at a cannabis flower, you could say, oh, that's

213
00:19:26,120 --> 00:19:27,720
high quality cannabis.

214
00:19:27,720 --> 00:19:30,640
That's low quality cannabis.

215
00:19:30,640 --> 00:19:35,920
And even when you start, say, for example, I worked at a cannabis testing laboratory

216
00:19:35,920 --> 00:19:41,520
and you start to see so many samples that you could actually even start to get a good

217
00:19:41,520 --> 00:19:44,040
approximation in your head.

218
00:19:44,040 --> 00:19:52,280
Like, okay, what percentage THC would this flower test at?

219
00:19:52,280 --> 00:19:56,960
You're not going to know spot on, but you can kind of just get an idea from just kind

220
00:19:56,960 --> 00:20:04,400
of looking at how covered in tri-combs the flower is.

221
00:20:04,400 --> 00:20:10,800
So what I was wondering is, would it be possible, and you've got a couple different images here,

222
00:20:10,800 --> 00:20:11,800
right?

223
00:20:11,800 --> 00:20:18,360
So you could maybe use the flower zoom now because this would be a bit more realistic

224
00:20:18,360 --> 00:20:20,640
in your application.

225
00:20:20,640 --> 00:20:28,440
But could you train a machine learning algorithm to predict the cannabinoid concentration by

226
00:20:28,440 --> 00:20:32,640
just looking at the image?

227
00:20:32,640 --> 00:20:40,320
Because the idea is for all of these samples, right, we have the concentration and we have

228
00:20:40,320 --> 00:20:41,320
the image.

229
00:20:41,320 --> 00:20:47,760
So you could actually, and like I said, I've never done any image processing, but I was

230
00:20:47,760 --> 00:20:54,280
just thinking that this actually is an application that you may be able to pull off.

231
00:20:54,280 --> 00:20:58,800
Is there any way that you can predict the concentration from the images?

232
00:20:58,800 --> 00:21:04,040
Because remember, we've got almost 50,000 observations.

233
00:21:04,040 --> 00:21:12,180
So long story short, as a diligent data scientist, I figured it would be worthwhile to just go

234
00:21:12,180 --> 00:21:21,680
ahead and record the image URLs, whether or not I'll use them, I don't know.

235
00:21:21,680 --> 00:21:27,080
But that's the cool thing about data is I always say never throw away data.

236
00:21:27,080 --> 00:21:34,040
Save every piece of data you can because who knows when it will be useful by who in the

237
00:21:34,040 --> 00:21:35,040
future.

238
00:21:35,040 --> 00:21:39,760
So I may never use the images and maybe some clever data scientist will think of a use

239
00:21:39,760 --> 00:21:40,760
for them.

240
00:21:40,760 --> 00:21:46,720
So long story short is you can go ahead and collect all the sample details through this.

241
00:21:46,720 --> 00:21:54,200
It's a little bit more complicated, but the code is code, so if you're interested in

242
00:21:54,200 --> 00:21:57,400
that you're welcome to read through the script.

243
00:21:57,400 --> 00:22:04,680
The idea is you can start to get all of the cool data points that you need.

244
00:22:04,680 --> 00:22:11,160
And then I promised you hashing algorithms, so I'll just go ahead and introduce those

245
00:22:11,160 --> 00:22:12,440
to you real quick.

246
00:22:12,440 --> 00:22:21,560
So essentially the way the hash works is you basically have a secret and a message.

247
00:22:21,560 --> 00:22:28,040
And so those are commonly called your private key and your public key.

248
00:22:28,040 --> 00:22:36,560
So in this case we'll just say okay, the date that this was tested is a quote unquote secret.

249
00:22:36,560 --> 00:22:37,640
It doesn't actually matter.

250
00:22:37,640 --> 00:22:41,200
I'm just using this to create a random ID.

251
00:22:41,200 --> 00:22:44,160
But I just kind of wanted to explain the concept to you.

252
00:22:44,160 --> 00:22:50,920
You basically got this secret that only you know and then somebody else has another piece

253
00:22:50,920 --> 00:22:51,920
of information.

254
00:22:51,920 --> 00:23:03,120
So in this case sample name and then basically if you combine these together in a hash you'll

255
00:23:03,120 --> 00:23:06,400
always get the same hash.

256
00:23:06,400 --> 00:23:11,400
So it's just going to be a bit, and I'll show you here in a second, it's going to be a big

257
00:23:11,400 --> 00:23:19,000
random string that it's unique and you can't go backwards.

258
00:23:19,000 --> 00:23:26,040
So the idea is once you have the hash you can't back out this information here.

259
00:23:26,040 --> 00:23:32,360
So it ends up being useful in many different scenarios.

260
00:23:32,360 --> 00:23:38,360
This may have been a little much of a tangent, but I don't know, I threw it in there.

261
00:23:38,360 --> 00:23:43,800
And then it's common to quote unquote salt the public key.

262
00:23:43,800 --> 00:23:50,120
So basically just have just another random piece, not random, but another piece of data,

263
00:23:50,120 --> 00:23:56,120
in this case the producer that you'd add to the public information.

264
00:23:56,120 --> 00:24:02,120
So that way somebody basically says, oh, I want presidential OG.

265
00:24:02,120 --> 00:24:08,440
I know it's coming from the producer, salt that, and then I'm the only one who knows

266
00:24:08,440 --> 00:24:10,320
what date it was tested.

267
00:24:10,320 --> 00:24:15,120
And then all of a sudden you can create the sample ID.

268
00:24:15,120 --> 00:24:18,960
And then that's just going to be this long random string.

269
00:24:18,960 --> 00:24:26,600
It's completely unique that no one else is going to accidentally create the same ID and

270
00:24:26,600 --> 00:24:29,920
you can't back out any of the information.

271
00:24:29,920 --> 00:24:37,440
So the idea is this random ID I can just readily share with a lot of people.

272
00:24:37,440 --> 00:24:43,960
It has a lot of information to me, but it has no information to anyone else.

273
00:24:43,960 --> 00:25:01,480
This has no information to anybody, but I can now, now if some producer wants a sample,

274
00:25:01,480 --> 00:25:06,080
I can potentially look this up on my end.

275
00:25:06,080 --> 00:25:16,520
So kind of a bit of a tangent, but I thought it would flow a bit smoother into the meetup

276
00:25:16,520 --> 00:25:17,520
for today.

277
00:25:17,520 --> 00:25:23,560
But if you're interested, feel free to talk with me more about that or what have you.

278
00:25:23,560 --> 00:25:29,440
But long story short is we can go ahead and finish collecting all the cool data points.

279
00:25:29,440 --> 00:25:39,280
So now we've got the analyses and I'm not worrying about cleaning the data much yet,

280
00:25:39,280 --> 00:25:45,800
other than just removing like HTML stuff.

281
00:25:45,800 --> 00:25:49,880
So I'm just getting the data as is and we'll clean it up later.

282
00:25:49,880 --> 00:26:00,280
So basically, and now we'll go ahead and get the link to the actual details.

283
00:26:00,280 --> 00:26:07,400
So we basically find all of this data here, except I skipped these compounds since we'll

284
00:26:07,400 --> 00:26:10,960
just get those from the sample specific page.

285
00:26:10,960 --> 00:26:12,360
So cool.

286
00:26:12,360 --> 00:26:21,800
So we basically, we've got its name, company, date tested, sample type, analyses, and image.

287
00:26:21,800 --> 00:26:24,920
So tons of information already.

288
00:26:24,920 --> 00:26:30,400
And then we can basically, we can get its COA.

289
00:26:30,400 --> 00:26:40,920
And once again, you can't just copy the PDF, but you could record the URL.

290
00:26:40,920 --> 00:26:50,680
That way you could, if need be, you could find the COA in the future.

291
00:26:50,680 --> 00:27:01,720
And here is where I want to really applaud PSI Labs and introduce the other topic that

292
00:27:01,720 --> 00:27:06,520
we were going to talk about today, the margin of error.

293
00:27:06,520 --> 00:27:17,720
And so PSI Labs has internally studied their methodology, right, so that you can see they're

294
00:27:17,720 --> 00:27:19,160
even listing it here.

295
00:27:19,160 --> 00:27:30,640
They're using GCFID, which is actually non-traditional, I would say, for cannabinoids.

296
00:27:30,640 --> 00:27:38,320
A lot of, or not even non-traditional, but it's a, remember this is 2016.

297
00:27:38,320 --> 00:27:47,880
So what would be actually interesting is to see, okay, in 2021, maybe they're using, the

298
00:27:47,880 --> 00:27:51,680
common method used today is HPLC.

299
00:27:51,680 --> 00:27:59,320
Back in the day, right, back in 2016, I think it was really common to do cannabinoid analysis

300
00:27:59,320 --> 00:28:01,560
by GC.

301
00:28:01,560 --> 00:28:10,280
This is gas chromatography versus liquid chromatography.

302
00:28:10,280 --> 00:28:18,200
And so if you're a diligent data scientist, as we are here, so we're going to record the

303
00:28:18,200 --> 00:28:23,760
method that was used, that could be applied statistically, right?

304
00:28:23,760 --> 00:28:32,720
So we may want to see if the method that you use to test cannabis has any statistical effect

305
00:28:32,720 --> 00:28:35,580
on the concentration, which it may.

306
00:28:35,580 --> 00:28:43,080
So we may find out that the HPLC is the superior method, and we may only want to, or the GC

307
00:28:43,080 --> 00:28:44,380
may be the superior method.

308
00:28:44,380 --> 00:28:46,160
We don't know until we test it, right?

309
00:28:46,160 --> 00:28:50,920
But I've got my hypothesis that the HPLC may be the superior method, since that's the one

310
00:28:50,920 --> 00:28:58,160
that people have kind of adopted, but you can do pretty incredible things with gas chromatography

311
00:28:58,160 --> 00:28:59,160
too.

312
00:28:59,160 --> 00:29:02,560
Gas chromatography is what people normally test terpenes with.

313
00:29:02,560 --> 00:29:06,080
Okay, so enough of those tangents.

314
00:29:06,080 --> 00:29:13,640
Another cool thing, so right, they've done an internal test, and they've tested and they

315
00:29:13,640 --> 00:29:21,320
found that, okay, we have about a 10% margin of error.

316
00:29:21,320 --> 00:29:27,200
So you can think, and I want to say this is maybe like a Z distribution or something,

317
00:29:27,200 --> 00:29:32,280
so I don't know where, how exactly they're getting these sample size numbers from, but

318
00:29:32,280 --> 00:29:40,900
let's say that these are accurate, then that would mean that PSI Labs has internally tested

319
00:29:40,900 --> 00:29:46,640
their method with a sample size of almost 100, right?

320
00:29:46,640 --> 00:29:51,080
So that means they created an internal standard, right?

321
00:29:51,080 --> 00:29:53,520
So what's an internal standard?

322
00:29:53,520 --> 00:29:59,360
It's basically just maybe say somebody, they're getting a bunch of cannabis sample in, so

323
00:29:59,360 --> 00:30:06,600
what they could do is they could grind up a bunch of flour or mix up a bunch of concentrate

324
00:30:06,600 --> 00:30:10,060
or grind up a bunch of cookies together.

325
00:30:10,060 --> 00:30:18,000
And the idea is you could create sort of a representative sample of what flour would

326
00:30:18,000 --> 00:30:25,640
may be, and then you can just, you basically just test, well actually, that's an internal

327
00:30:25,640 --> 00:30:26,640
standard.

328
00:30:26,640 --> 00:30:31,240
They probably actually need to get a third party standards for this.

329
00:30:31,240 --> 00:30:43,240
So they probably actually have to pay a third party company that says, oh, we sell 99% THC

330
00:30:43,240 --> 00:30:44,240
solution.

331
00:30:44,240 --> 00:30:52,800
Then you'd basically buy the 99% THC solution or what have you, and then you may dilute

332
00:30:52,800 --> 00:30:53,800
it, right?

333
00:30:53,800 --> 00:31:03,080
You may buy a 99% THC dilution, you may dilute it to the point where it should have a 50%

334
00:31:03,080 --> 00:31:04,240
concentration.

335
00:31:04,240 --> 00:31:07,500
You test it 100 times.

336
00:31:07,500 --> 00:31:15,640
So that means they, right, they created this reference standard, then they ran 100 samples

337
00:31:15,640 --> 00:31:23,320
on their GC, and then their margin of error had to be within 10%.

338
00:31:23,320 --> 00:31:28,600
So that's actually a pretty wide margin of error when you think about it.

339
00:31:28,600 --> 00:31:33,680
So this has a 50% concentration of THC.

340
00:31:33,680 --> 00:31:38,960
That means it could have, yes, so it could have as low as, don't trust any of the statistics

341
00:31:38,960 --> 00:31:45,600
or math I'm telling you today because as I said, we were off to a super rocky start.

342
00:31:45,600 --> 00:31:55,000
But long story short is it's important to take into consideration your uncertainty.

343
00:31:55,000 --> 00:32:03,520
It's useful to know that this sample has a minuscule amount of CBDA detected, but the

344
00:32:03,520 --> 00:32:11,360
uncertainty is so great that statistically I don't think that's necessarily different

345
00:32:11,360 --> 00:32:13,080
from zero.

346
00:32:13,080 --> 00:32:18,080
CBG may be statistically different than zero.

347
00:32:18,080 --> 00:32:25,320
And so, you know, and then just kind of keep that in mind.

348
00:32:25,320 --> 00:32:29,200
Actually the margin of error is quite low on this sample.

349
00:32:29,200 --> 00:32:36,320
But long story short is this is a useful statistic that I do believe most laboratories calculate,

350
00:32:36,320 --> 00:32:45,760
but I think very few report on their COAs.

351
00:32:45,760 --> 00:32:48,160
And I'll get to the juicy part here too.

352
00:32:48,160 --> 00:32:51,960
But basically this is kind of typical, but maybe even on the low end.

353
00:32:51,960 --> 00:33:00,480
So this is a 10% margin of error, which, you know, it's not, you can see higher in the

354
00:33:00,480 --> 00:33:02,480
cannabis industry.

355
00:33:02,480 --> 00:33:08,740
In fact, you know, we were looking at plant patents and it was kind of, it became pretty

356
00:33:08,740 --> 00:33:14,800
apparent that the plant patent we were looking at, they were working with a 20% margin of

357
00:33:14,800 --> 00:33:15,800
error.

358
00:33:15,800 --> 00:33:22,280
And so, you know, cultivators, I think they just may need to get edgy.

359
00:33:22,280 --> 00:33:23,280
Right?

360
00:33:23,280 --> 00:33:25,040
I never fault anyone, right?

361
00:33:25,040 --> 00:33:30,360
So people will get flustered and they'll say, oh, you know, we sent three different samples

362
00:33:30,360 --> 00:33:34,640
to three different laboratories and got back three different results.

363
00:33:34,640 --> 00:33:41,240
One, are those, you know, three different results, are those within the laboratory?

364
00:33:41,240 --> 00:33:48,080
If you sent them to the same laboratory, so let's say you sent this presidential flower

365
00:33:48,080 --> 00:34:00,320
to PSI labs three times, you know, they may send you back a 22%, a 25%, and then they

366
00:34:00,320 --> 00:34:02,800
send this 23% here.

367
00:34:02,800 --> 00:34:11,080
And so, numbers like that can fluster a cultivator, right, because they're like, oh, you know,

368
00:34:11,080 --> 00:34:12,080
which is it?

369
00:34:12,080 --> 00:34:15,600
Is it 25% or is it 21%?

370
00:34:15,600 --> 00:34:25,440
Those numbers, whether we like it or not, could easily influence the price that that

371
00:34:25,440 --> 00:34:28,080
would set you at retail.

372
00:34:28,080 --> 00:34:38,360
So basically, it's just, it's an information game slash problem where it's just trying

373
00:34:38,360 --> 00:34:40,760
to get the information out the best they can.

374
00:34:40,760 --> 00:34:46,800
And that's why I applaud PSI labs, is they're doing their part to get that information out

375
00:34:46,800 --> 00:34:47,800
there.

376
00:34:47,800 --> 00:34:51,800
And so I don't think we have it for every compound here, but they're reporting it for

377
00:34:51,800 --> 00:34:55,760
the Campanoids, which is awesome.

378
00:34:55,760 --> 00:35:02,160
And so then like diligent data scientists will go ahead and collect the margin of error.

379
00:35:02,160 --> 00:35:08,800
And like I said, I don't necessarily know how I'm going to use it right away, but we

380
00:35:08,800 --> 00:35:14,520
can step one, right, step one, get the data.

381
00:35:14,520 --> 00:35:19,120
So we'll just get the data.

382
00:35:19,120 --> 00:35:26,400
So what's this even going to look like?

383
00:35:26,400 --> 00:35:32,320
You know, I'm getting their QR code, which I may not even get.

384
00:35:32,320 --> 00:35:41,560
So basically, I was kind of thinking about this and, right, it's always good to be transparent

385
00:35:41,560 --> 00:35:43,800
about all the potential problems.

386
00:35:43,800 --> 00:35:51,320
I was speaking with the laboratory in Las Vegas and they did not like the idea.

387
00:35:51,320 --> 00:35:52,320
Actually I take that back.

388
00:35:52,320 --> 00:35:57,760
Maybe it was a, actually, I think it was a cultivator in Oregon and they did not like

389
00:35:57,760 --> 00:36:07,680
the idea of having their COAs out there because they said that people would copy their COAs

390
00:36:07,680 --> 00:36:11,000
and you know, fraudulently use them.

391
00:36:11,000 --> 00:36:15,680
And that's no good.

392
00:36:15,680 --> 00:36:19,920
And so some labs, you're sort of implementing things like QR codes.

393
00:36:19,920 --> 00:36:22,120
That way you scan the QR code.

394
00:36:22,120 --> 00:36:25,080
It goes to PSI labs website.

395
00:36:25,080 --> 00:36:27,320
You can kind of confirm.

396
00:36:27,320 --> 00:36:32,960
So I don't necessarily know if I should use this QR code or not.

397
00:36:32,960 --> 00:36:41,320
I was actually going to say that the QR code may be sort of a 20th century technology.

398
00:36:41,320 --> 00:36:51,080
And the 21st century technology equivalent of a QR code would be, Candace?

399
00:36:51,080 --> 00:36:52,080
Drum roll.

400
00:36:52,080 --> 00:36:57,040
It's actually going to be essentially a data NFT.

401
00:36:57,040 --> 00:37:00,320
That's correct with Ocean, yes.

402
00:37:00,320 --> 00:37:06,640
And so that's why I was sort of introducing these hashes because I mean,

403
00:37:06,640 --> 00:37:07,640
Ah, okay.

404
00:37:07,640 --> 00:37:09,880
I get the logic for sure.

405
00:37:09,880 --> 00:37:15,240
Well I'll kind of get, we can kind of maybe build this more up next week.

406
00:37:15,240 --> 00:37:23,720
But the idea is, right, you can take any random, not random, but you can take any message and

407
00:37:23,720 --> 00:37:30,440
encode it with a private key and you'll get back this hash.

408
00:37:30,440 --> 00:37:37,920
And so the idea, I do believe, with the data NFTs is, right, and I may be mistaking this

409
00:37:37,920 --> 00:37:48,280
entirely, but I do believe if you have a certain set of data and you hash it, you'll get a

410
00:37:48,280 --> 00:37:50,520
unique hash, right?

411
00:37:50,520 --> 00:37:54,240
And it'll just, that's your hash, right?

412
00:37:54,240 --> 00:37:58,080
That data will always give you that hash.

413
00:37:58,080 --> 00:38:02,360
If the data changes, you'll have a different hash.

414
00:38:02,360 --> 00:38:12,120
And so that's sort of the idea behind traceability is, right, once the laboratory reports the

415
00:38:12,120 --> 00:38:16,520
numbers, right, it's got a hash.

416
00:38:16,520 --> 00:38:21,440
And the idea is the hash is sort of on the blockchain, so a bunch of people will agree

417
00:38:21,440 --> 00:38:27,720
on the hash and you can't just go back in time and change the hash willy-nilly.

418
00:38:27,720 --> 00:38:33,040
You know, you can change it for the future and then everything will be recorded, but

419
00:38:33,040 --> 00:38:37,000
you know, this original hash will kind of exist.

420
00:38:37,000 --> 00:38:40,360
Right, like a ledger.

421
00:38:40,360 --> 00:38:41,360
Exactly.

422
00:38:41,360 --> 00:38:49,840
So basically the idea is, you know, if a laboratory reports this set of results, they'll get hashed

423
00:38:49,840 --> 00:38:53,800
and that will sort of just be this unique hash.

424
00:38:53,800 --> 00:39:01,960
So that way, you know, everybody can basically agree that yes, you know, those were the results.

425
00:39:01,960 --> 00:39:10,840
And like I said, I am not the smartest cookie in the jar, so don't try to go to me for an

426
00:39:10,840 --> 00:39:14,000
explanation about how the blockchain works.

427
00:39:14,000 --> 00:39:22,920
But basically that's my monkey brain understanding of what's going on is somehow the data is

428
00:39:22,920 --> 00:39:33,160
getting hashed, the hash is unique, everyone agrees about the data going in on it, and

429
00:39:33,160 --> 00:39:40,400
you know, as far as the data NFTs, there may be essentially some sort of ownership involved.

430
00:39:40,400 --> 00:39:46,720
But like I said, I'm going to read up more on this, on the statistics and data, so that's

431
00:39:46,720 --> 00:39:54,240
what I'll teach you and I'll try to learn more about some of these other cool technologies.

432
00:39:54,240 --> 00:39:56,440
No, it's really cool.

433
00:39:56,440 --> 00:40:04,480
And then with Ocean, with their training, they actually have a Rinkeby test network,

434
00:40:04,480 --> 00:40:05,840
so you can play around with it.

435
00:40:05,840 --> 00:40:11,640
You can actually send messages, transactions back and forth, get the different hashes and

436
00:40:11,640 --> 00:40:16,960
kind of get a real feel for what's going on without having to hook in your bank account

437
00:40:16,960 --> 00:40:17,960
or credit card, right?

438
00:40:17,960 --> 00:40:18,960
It's all free.

439
00:40:18,960 --> 00:40:21,080
They can actually give you a free coinage too.

440
00:40:21,080 --> 00:40:23,400
It's pretty interesting, their training.

441
00:40:23,400 --> 00:40:26,360
And I think we may just be sort of on the cutting edge of it, right?

442
00:40:26,360 --> 00:40:33,960
It's basically like, remember the secure hashing algorithms came about in 2002.

443
00:40:33,960 --> 00:40:41,920
Bitcoin came about in like 2008 or 2009 or so.

444
00:40:41,920 --> 00:40:50,240
So it's not like, and these days, if you're talking to a software developer, they're going

445
00:40:50,240 --> 00:40:55,800
to expect you to implement some sort of hashing algorithm.

446
00:40:55,800 --> 00:40:57,600
If you're doing like authentic, right?

447
00:40:57,600 --> 00:41:02,640
So like I was working with a developer and we were doing authentication for an API.

448
00:41:02,640 --> 00:41:10,040
It's like, yeah, you need to implement some sort of hashing algorithm, at least the SHA-256,

449
00:41:10,040 --> 00:41:11,040
right?

450
00:41:11,040 --> 00:41:18,080
There's more complicated ones that exist today, but this is sort of an industry standard.

451
00:41:18,080 --> 00:41:27,080
So it's basically like, once a new technology will eventually become the standard.

452
00:41:27,080 --> 00:41:36,280
So I think QR codes are useful, but I think especially for laboratory testing, I think

453
00:41:36,280 --> 00:41:42,040
things like data NFTs, essentially smart contracts, I think those are just going to end up being

454
00:41:42,040 --> 00:41:46,240
standard place with laboratory testing.

455
00:41:46,240 --> 00:41:52,280
So that's what I'm trying to, and not like overnight, it may be five or 10 years from

456
00:41:52,280 --> 00:41:56,280
now, but that's what my hypothesis is.

457
00:41:56,280 --> 00:41:58,760
So that's why I'm trying to learn about them.

458
00:41:58,760 --> 00:42:04,760
Okay, so now we're kind of at the end.

459
00:42:04,760 --> 00:42:11,760
I'll have to think about some way to make it up to you all for the rocky start today.

460
00:42:11,760 --> 00:42:17,400
But basically, wrote all of this code into these reusable functions.

461
00:42:17,400 --> 00:42:27,400
So now you can basically go through and I shared this script on Slack, but I'll make

462
00:42:27,400 --> 00:42:30,360
sure to email it out to everybody as well.

463
00:42:30,360 --> 00:42:34,560
And then I'll also post it all to GitHub.

464
00:42:34,560 --> 00:42:36,600
And I will email you the data.

465
00:42:36,600 --> 00:42:48,320
And so the idea is you can now run this script on all the pages and collect the data and

466
00:42:48,320 --> 00:42:53,600
we can look at the data.

467
00:42:53,600 --> 00:42:59,560
And keep in mind, I will deliver you the near 50,000 results that were promised.

468
00:42:59,560 --> 00:43:07,680
I was only able to run this for two or three hours yesterday afternoon.

469
00:43:07,680 --> 00:43:16,400
So in those three hours or so, I collected 2,000 lab results for you.

470
00:43:16,400 --> 00:43:18,840
And then I'll go ahead and immediately share.

471
00:43:18,840 --> 00:43:21,600
These are their most recent 2,000.

472
00:43:21,600 --> 00:43:23,560
And I know I promised you 50,000.

473
00:43:23,560 --> 00:43:26,560
So I will deliver.

474
00:43:26,560 --> 00:43:29,040
If you will be so kind to wait.

475
00:43:29,040 --> 00:43:32,080
So I'll go ahead and start collecting the rest.

476
00:43:32,080 --> 00:43:41,200
And as I said, it takes about an hour and a half to collect 100 samples.

477
00:43:41,200 --> 00:43:50,840
So it could literally take me two or three days of computing power to collect all 50,000.

478
00:43:50,840 --> 00:43:54,840
So I'm sorry if I was a little misleading.

479
00:43:54,840 --> 00:43:56,960
I hope I didn't mislead you.

480
00:43:56,960 --> 00:44:03,680
But everyone who signed up will get all of this data in full.

481
00:44:03,680 --> 00:44:09,400
And better than that is this is the unprocessed data.

482
00:44:09,400 --> 00:44:14,560
And so basically, and I'll go ahead and wrap up here since we don't need to beat you to

483
00:44:14,560 --> 00:44:15,560
death with this.

484
00:44:15,560 --> 00:44:23,720
But basically the idea is, okay, now that we have it all collected, we can go through,

485
00:44:23,720 --> 00:44:25,520
clean it, right?

486
00:44:25,520 --> 00:44:31,520
We can remove the delta symbols, right?

487
00:44:31,520 --> 00:44:34,560
We can parse out the units.

488
00:44:34,560 --> 00:44:40,120
We can handle these positive, negative signs and margin of error.

489
00:44:40,120 --> 00:44:45,920
We can do all that fine, dandy cleaning stuff after we've collected it.

490
00:44:45,920 --> 00:44:47,400
So I'll do that for you.

491
00:44:47,400 --> 00:44:53,000
And then basically next week, I was going to do this today, but it was just too jam-packed

492
00:44:53,000 --> 00:44:58,920
of a day and we had such a – thanks to me, we had a rough start.

493
00:44:58,920 --> 00:45:03,800
But for next week – and we can have a bit more of a conversation about the data next

494
00:45:03,800 --> 00:45:04,800
week too.

495
00:45:04,800 --> 00:45:08,480
But next week we can finally start to explore this.

496
00:45:08,480 --> 00:45:20,520
And as I said, the main question – and I actually saw some cultivators and a business

497
00:45:20,520 --> 00:45:26,280
magazine talking about this, and so I think this is what people are interested in is,

498
00:45:26,280 --> 00:45:38,280
are cultivators able to essentially get better and better at breeding for these compounds?

499
00:45:38,280 --> 00:45:42,000
And so I kind of wanted to do a time series analysis, right?

500
00:45:42,000 --> 00:45:48,480
We've got results going back to 2016 all the way to 2021.

501
00:45:48,480 --> 00:45:56,400
So in this five or six-year time span, have breeders been able to breed for statistically

502
00:45:56,400 --> 00:46:09,640
higher levels of D-limonene, beta-pinene, mercine, curiophiline, linalool, THC, CBD?

503
00:46:09,640 --> 00:46:14,480
Or have processors been able to get more efficient?

504
00:46:14,480 --> 00:46:18,420
Are concentrates increasing over time?

505
00:46:18,420 --> 00:46:24,940
So I think these are questions that people are interested in that I think the Cannabis

506
00:46:24,940 --> 00:46:28,960
Data Science team is uniquely positioned to answer.

507
00:46:28,960 --> 00:46:34,120
So first step, first we'll just get the data.

508
00:46:34,120 --> 00:46:37,440
Then we'll clean it up and analyze it.

509
00:46:37,440 --> 00:46:38,440
So my apologies.

510
00:46:38,440 --> 00:46:42,520
It's a completely mess of a presentation today.

511
00:46:42,520 --> 00:46:49,200
So hopefully once I get this data in your hands, that may make up for things.

512
00:46:49,200 --> 00:46:55,240
But does anybody have any questions, comments, thoughts before we call it a day?

513
00:46:55,240 --> 00:46:56,240
All right.

514
00:46:56,240 --> 00:46:57,240
Too cool.

515
00:46:57,240 --> 00:46:58,240
Until next time, have an awesome day.

516
00:46:58,240 --> 00:47:13,880
Keep advancing Cannabis Science.

