1
00:00:00,000 --> 00:00:08,360
Welcome to the Cannabis Data Science Meetup group.

2
00:00:08,360 --> 00:00:14,200
Fun group of cannabis enthusiasts, data scientists, anyone who's interested in wrangling data

3
00:00:14,200 --> 00:00:15,880
into the cannabis space.

4
00:00:15,880 --> 00:00:21,880
So see a bunch of friendly faces and then Prince introduced herself.

5
00:00:21,880 --> 00:00:26,680
So good to see you Grant, Hector, Tammy, Candice.

6
00:00:26,680 --> 00:00:31,280
Just to give everybody a chance to speak if you want.

7
00:00:31,280 --> 00:00:35,920
Before Candice and I spill the beans about some of the nifty things that we've been cooking

8
00:00:35,920 --> 00:00:37,920
up.

9
00:00:37,920 --> 00:00:38,920
Does anyone else want to speak?

10
00:00:38,920 --> 00:00:44,960
I'll just go around and just see if anyone has any topics that you're curious in tackling

11
00:00:44,960 --> 00:00:50,760
in the coming months because we're actually ready to employ our talents onto some new

12
00:00:50,760 --> 00:00:51,760
projects.

13
00:00:51,760 --> 00:00:59,480
So I'll maybe hint at some of those because as all we'll demonstrate today, one of these

14
00:00:59,480 --> 00:01:03,240
tools that we're cooking up is coming along nicely.

15
00:01:03,240 --> 00:01:08,360
So before we get to this Hector, anything on your plate or agenda?

16
00:01:08,360 --> 00:01:18,100
No, I'm still looking for work and I'm doing some projects on the side with data analysis

17
00:01:18,100 --> 00:01:21,160
just to keep my skills fresh.

18
00:01:21,160 --> 00:01:25,120
I'm excited to see what you have for us today.

19
00:01:25,120 --> 00:01:27,560
Too cool.

20
00:01:27,560 --> 00:01:36,040
And one of the highest in demand things is so we've done natural language processing

21
00:01:36,040 --> 00:01:42,320
and one thing that we may attempt to do, I don't think, yeah we haven't done it yet,

22
00:01:42,320 --> 00:01:45,560
is natural language generation.

23
00:01:45,560 --> 00:01:52,560
So that takes it one step further so that's almost given a prompt, create text, and there's

24
00:01:52,560 --> 00:01:56,240
actually a nifty application for this.

25
00:01:56,240 --> 00:01:58,640
So technically that's data science.

26
00:01:58,640 --> 00:02:06,240
So yeah, so we can actually maybe move on to this before too, too long.

27
00:02:06,240 --> 00:02:07,240
So that's another teaser.

28
00:02:07,240 --> 00:02:11,960
So but Tammy, it's going to be fine.

29
00:02:11,960 --> 00:02:19,840
So Tammy, anything on your plate that you're interested in working on?

30
00:02:19,840 --> 00:02:28,880
Well it's been a while since I've attended but since then I have graduated from my data

31
00:02:28,880 --> 00:02:35,000
analytics boot camp and am actively looking for a job.

32
00:02:35,000 --> 00:02:44,280
So I would like to, there's my camera, now I'm back in my office, I'd like to work on

33
00:02:44,280 --> 00:02:50,640
some things and maybe improve my portfolio.

34
00:02:50,640 --> 00:02:53,560
Too cool Tammy.

35
00:02:53,560 --> 00:02:59,960
Remind me if you don't mind what state you may be in just so I kind of have a, just knowledge,

36
00:02:59,960 --> 00:03:03,120
you don't have to share, I guess that's kind of personal information, but if you want to

37
00:03:03,120 --> 00:03:10,160
do that way I can kind of just understand the cannabis regulations in your space.

38
00:03:10,160 --> 00:03:13,160
Well I am in the mountains of Virginia.

39
00:03:13,160 --> 00:03:14,160
Virginia.

40
00:03:14,160 --> 00:03:23,160
Yep, yep, it's not as big of a mountains as what's behind Hector.

41
00:03:23,160 --> 00:03:27,040
So this is, it'll be interesting to see how Virginia pans out.

42
00:03:27,040 --> 00:03:35,960
So just to be frank, there were, it seems like there were maybe some people that saw

43
00:03:35,960 --> 00:03:39,040
a lot of the ball coming before others.

44
00:03:39,040 --> 00:03:40,040
Right.

45
00:03:40,040 --> 00:03:45,080
Right, you know, you're in the right channels, you may have just kind of heard about this.

46
00:03:45,080 --> 00:03:51,720
So you know, some people have been able to kind of get a good running at this, in particular

47
00:03:51,720 --> 00:03:54,520
hemp companies and there's no problem there.

48
00:03:54,520 --> 00:04:00,680
In fact, I've always been going on about, oh, hey, companies, you know, you may want

49
00:04:00,680 --> 00:04:05,200
to think about positioning yourself to enter, right.

50
00:04:05,200 --> 00:04:10,800
So it worked out well for the hemp growers in New York, big time.

51
00:04:10,800 --> 00:04:16,360
Well I know there was North Carolina is where I'm from.

52
00:04:16,360 --> 00:04:24,920
And the farmers all around, like right beside my in-laws house, they were growing hemp and

53
00:04:24,920 --> 00:04:32,040
they stopped after one year because of, there were vehicles running through their yard and

54
00:04:32,040 --> 00:04:35,040
through their fence to get to the field.

55
00:04:35,040 --> 00:04:37,040
It's crazy.

56
00:04:37,040 --> 00:04:44,400
Actually, that's an interesting complication that you raise is, you know, you've got to

57
00:04:44,400 --> 00:04:49,320
have security dialed in and it's maybe that's something that maybe people don't take into

58
00:04:49,320 --> 00:04:51,360
consideration with the outdoor, right.

59
00:04:51,360 --> 00:04:55,920
Because if you've got your indoor facility, you can put locks on your doors, some would

60
00:04:55,920 --> 00:04:59,000
have cameras, but you're right.

61
00:04:59,000 --> 00:05:05,040
I don't know if you could just put a hemp field right beside a highway and you're right,

62
00:05:05,040 --> 00:05:08,080
people may be romping through.

63
00:05:08,080 --> 00:05:16,480
But anywho, like I said, it'll be interesting to see how things pan out.

64
00:05:16,480 --> 00:05:19,720
I'm optimistic.

65
00:05:19,720 --> 00:05:27,040
And then in Virginia, you're actually not that far from Maryland as well as the Northeast.

66
00:05:27,040 --> 00:05:33,780
So it's kind of getting up there, but you know, New Jersey, there's a lot of action

67
00:05:33,780 --> 00:05:34,780
going on there.

68
00:05:34,780 --> 00:05:39,560
And, you know, with data analytics, you know, you can do a lot online.

69
00:05:39,560 --> 00:05:41,160
So it wouldn't take.

70
00:05:41,160 --> 00:05:48,080
So long story short, as we'll show you some opportunities today that primarily have to

71
00:05:48,080 --> 00:05:50,000
do with lab data.

72
00:05:50,000 --> 00:05:54,720
And you may want to think about reaching out to some of the laboratories in Maryland.

73
00:05:54,720 --> 00:06:01,280
I know a laboratory in New Jersey, but you may want to get in contact with more there.

74
00:06:01,280 --> 00:06:07,840
New York, I don't know who's testing in New York, but probably somebody is.

75
00:06:07,840 --> 00:06:12,080
So I mean, so that's a start.

76
00:06:12,080 --> 00:06:13,680
So that's what I started to do.

77
00:06:13,680 --> 00:06:17,200
The laboratories have a lot on their plate, though.

78
00:06:17,200 --> 00:06:23,220
So there's a lot of people that need help with laboratory data.

79
00:06:23,220 --> 00:06:27,320
So in fact, the retailers, the retailers are probably going to be the easiest to get in

80
00:06:27,320 --> 00:06:32,080
contact with, but they're probably going to have the, you know, the highest guards up,

81
00:06:32,080 --> 00:06:33,080
right?

82
00:06:33,080 --> 00:06:34,080
They don't.

83
00:06:34,080 --> 00:06:35,080
Right.

84
00:06:35,080 --> 00:06:36,080
Right.

85
00:06:36,080 --> 00:06:37,080
But it doesn't hurt.

86
00:06:37,080 --> 00:06:41,640
So my recommendation is send some of these find there if they have the contact information

87
00:06:41,640 --> 00:06:44,560
on their website.

88
00:06:44,560 --> 00:06:49,520
You may want to reach out to them that way because someone's checking the inbox, right?

89
00:06:49,520 --> 00:06:55,560
And so long story short is we're going to be doing some certificate of analysis parsing

90
00:06:55,560 --> 00:07:01,480
and it's all hands on deck and we're doing some in California.

91
00:07:01,480 --> 00:07:07,480
I was going to mention, going to see what we can't do in Washington state and Prince,

92
00:07:07,480 --> 00:07:11,120
Colorado, so we can be in touch there.

93
00:07:11,120 --> 00:07:15,000
And then like I said, the whole Eastern seaboard, right?

94
00:07:15,000 --> 00:07:16,320
You know, it's all hands on deck.

95
00:07:16,320 --> 00:07:22,960
So there's probably people over there that need certificate of analysis parsing as well.

96
00:07:22,960 --> 00:07:23,960
Right.

97
00:07:23,960 --> 00:07:31,840
So anywho, Prince, I know you're listening in, so feel free to chime in at any point.

98
00:07:31,840 --> 00:07:36,600
And then Grant, is there anything on your plate that you want answered?

99
00:07:36,600 --> 00:07:42,440
Any data science topics on the brain?

100
00:07:42,440 --> 00:07:43,440
Not particularly.

101
00:07:43,440 --> 00:07:47,440
I've actually, I've just been struggling to actually find the time to join this meeting.

102
00:07:47,440 --> 00:07:52,800
I joined sort of right before, you know, you had set up the new time and everything.

103
00:07:52,800 --> 00:07:56,480
And yeah, I've just always had work meetings overlapping except for today.

104
00:07:56,480 --> 00:08:03,040
So yeah, I'm just stoked to just learn more, honestly.

105
00:08:03,040 --> 00:08:06,200
Happy to have you Grant and happy the stars aligned.

106
00:08:06,200 --> 00:08:13,400
In fact, this is not really the lesson of the day, but it's something that's kind of

107
00:08:13,400 --> 00:08:15,880
a recurring theme, stars aligning.

108
00:08:15,880 --> 00:08:21,200
But anywho, candidates, anything on your mind?

109
00:08:21,200 --> 00:08:24,120
I know we've been doing a lot of certificate parsing.

110
00:08:24,120 --> 00:08:31,520
So anything jumping out at you that you'd like to share or thoughts?

111
00:08:31,520 --> 00:08:32,520
Thank you, Keegan.

112
00:08:32,520 --> 00:08:33,520
Let's see.

113
00:08:33,520 --> 00:08:42,480
Yeah, we've been spending a lot of time web scraping URLs and parsing PDFs.

114
00:08:42,480 --> 00:08:43,920
And it's been pretty cool.

115
00:08:43,920 --> 00:08:46,320
We sure do have a lot of data.

116
00:08:46,320 --> 00:08:54,320
And so what I have not completed is I want to do some plots on terpene ratios.

117
00:08:54,320 --> 00:08:58,440
And but I haven't completed that yet.

118
00:08:58,440 --> 00:09:00,960
I just started that today.

119
00:09:00,960 --> 00:09:02,880
But we sure do have a lot of data.

120
00:09:02,880 --> 00:09:05,960
So we'll we'll look at some terpene ratios today.

121
00:09:05,960 --> 00:09:11,280
And one thing that's fun about them is it's as we were talking about with John is it's

122
00:09:11,280 --> 00:09:15,440
sort of a good sort of like I like to call these sanity checks.

123
00:09:15,440 --> 00:09:16,440
Right.

124
00:09:16,440 --> 00:09:22,240
So you do so much work and it gets a little abstract that sometimes you just kind of need

125
00:09:22,240 --> 00:09:25,480
to put things down on paper and do a sanity check.

126
00:09:25,480 --> 00:09:28,320
And so it's like, you know, we've done so much data wrangling.

127
00:09:28,320 --> 00:09:34,320
Like, it's just time to visualize it a little bit just as a sanity check, just see like,

128
00:09:34,320 --> 00:09:37,000
okay, you know, are these percentages?

129
00:09:37,000 --> 00:09:40,840
Are they between zero and 100?

130
00:09:40,840 --> 00:09:42,480
What does the distribution look like?

131
00:09:42,480 --> 00:09:44,200
So we'll do some of those things today.

132
00:09:44,200 --> 00:09:47,360
It's just, you know, just important.

133
00:09:47,360 --> 00:09:53,280
It's all like, you know, as a good data scientist, rule number one, you know, look at the data.

134
00:09:53,280 --> 00:09:54,280
It sounds great.

135
00:09:54,280 --> 00:09:57,320
Anywho, George, welcome to the group today.

136
00:09:57,320 --> 00:10:04,240
We've got John, anything on your mind today, George, about data science with the merger

137
00:10:04,240 --> 00:10:07,920
of the cannabis industry and anything you'd like to answer?

138
00:10:07,920 --> 00:10:08,920
I have no idea.

139
00:10:08,920 --> 00:10:09,920
I'm here to learn.

140
00:10:09,920 --> 00:10:16,160
I'm a student, I'm a master's of data science student at University of Colorado Boulder.

141
00:10:16,160 --> 00:10:17,160
Too cool.

142
00:10:17,160 --> 00:10:19,840
Well, I'm actually here for an assignment.

143
00:10:19,840 --> 00:10:24,480
I have to attend a data science meetup and do a writeup about it.

144
00:10:24,480 --> 00:10:30,280
And this one seemed interesting, seemed to be one of the more interesting ones.

145
00:10:30,280 --> 00:10:33,240
Well, I love it.

146
00:10:33,240 --> 00:10:37,360
So is there anything to get done, you know, the chemiletics name out there?

147
00:10:37,360 --> 00:10:42,400
And so, you know, the University of Colorado points you in our direction, then that's superb.

148
00:10:42,400 --> 00:10:46,480
And in fact, I think you'll walk away with something out of it.

149
00:10:46,480 --> 00:10:48,280
Hopefully lots of cool things going on.

150
00:10:48,280 --> 00:10:50,360
So John, good to see you.

151
00:10:50,360 --> 00:10:56,880
We're just about to kick off with some of the showcasing of the new tools that we've

152
00:10:56,880 --> 00:10:57,880
built.

153
00:10:57,880 --> 00:11:03,720
So before we really dive into that, anything on your mind, cannabis data science related

154
00:11:03,720 --> 00:11:06,480
that you want to talk about?

155
00:11:06,480 --> 00:11:13,600
Check the mic.

156
00:11:13,600 --> 00:11:16,800
You may have your own.

157
00:11:16,800 --> 00:11:24,120
We can't hear you, John.

158
00:11:24,120 --> 00:11:25,120
We can hear you now.

159
00:11:25,120 --> 00:11:26,120
Yeah, sorry.

160
00:11:26,120 --> 00:11:28,640
It helps to unmute.

161
00:11:28,640 --> 00:11:37,280
I'm real interested in what Boulder has in terms of data science curriculum and how you

162
00:11:37,280 --> 00:11:40,840
would be steered to something like this, George.

163
00:11:40,840 --> 00:11:41,840
That's very cool.

164
00:11:41,840 --> 00:11:42,840
Right.

165
00:11:42,840 --> 00:11:49,200
They have this new master's program specifically for data science.

166
00:11:49,200 --> 00:11:51,640
I think it just it's new as of last year.

167
00:11:51,640 --> 00:11:55,080
It's about a year old now.

168
00:11:55,080 --> 00:11:58,560
And what credentials do you need to get into it?

169
00:11:58,560 --> 00:12:06,440
I mean, what kind of undergraduate degree are they needing to get so that you can get

170
00:12:06,440 --> 00:12:07,960
credential?

171
00:12:07,960 --> 00:12:09,920
It's pretty interdisciplinary.

172
00:12:09,920 --> 00:12:16,680
This program is actually they have performance based admissions.

173
00:12:16,680 --> 00:12:19,320
So you take a few classes unenrolled.

174
00:12:19,320 --> 00:12:22,000
But my bachelor's is in computer science.

175
00:12:22,000 --> 00:12:28,360
I think bachelor's in any kind of math, applied math statistics would be that would be good

176
00:12:28,360 --> 00:12:29,440
to that would work.

177
00:12:29,440 --> 00:12:33,200
And I think there are a lot of interdisciplinary people from that are coming from fields like

178
00:12:33,200 --> 00:12:34,200
biology.

179
00:12:34,200 --> 00:12:35,200
Excellent.

180
00:12:35,200 --> 00:12:41,440
Well, I'd be happy to learn more.

181
00:12:41,440 --> 00:12:44,600
I would just say you've got a good background, George.

182
00:12:44,600 --> 00:12:49,880
Computer science is one of the best backgrounds you could have because you'll need to touch

183
00:12:49,880 --> 00:12:50,880
up.

184
00:12:50,880 --> 00:12:51,880
You'll need to, of course, learn some statistics.

185
00:12:51,880 --> 00:12:52,880
Right.

186
00:12:52,880 --> 00:12:56,440
A lot of the statistics that I knew the hood, but I always emphasize it's important to,

187
00:12:56,440 --> 00:13:01,520
you know, understand what's going on under the hood in case something breaks.

188
00:13:01,520 --> 00:13:09,040
But for the most part, once you understand the statistics, then you can use it.

189
00:13:09,040 --> 00:13:11,680
Then that's the easy part.

190
00:13:11,680 --> 00:13:21,120
The hard part is getting your model or your statistical model, your algorithm into production.

191
00:13:21,120 --> 00:13:24,760
And so that's wide open.

192
00:13:24,760 --> 00:13:27,960
So that could be, you know, are you going to be running it in the cloud?

193
00:13:27,960 --> 00:13:30,240
Are you going to have your own server?

194
00:13:30,240 --> 00:13:33,700
I mean, you all you know all that jazz.

195
00:13:33,700 --> 00:13:35,680
So that's sort of the fun part, right.

196
00:13:35,680 --> 00:13:42,120
So since you already have all those skills in your tool belt, as you start learning,

197
00:13:42,120 --> 00:13:44,680
you know, you may want to.

198
00:13:44,680 --> 00:13:48,640
In fact, I would recommend while you're at the University of Colorado, you may want to

199
00:13:48,640 --> 00:13:52,000
go get acquainted with some of the statisticians there.

200
00:13:52,000 --> 00:13:53,000
Right.

201
00:13:53,000 --> 00:13:58,720
There's probably some brilliant statisticians there that you could learn a lot from in just

202
00:13:58,720 --> 00:14:05,000
30 minutes or more, you know, sit in on some of their lectures.

203
00:14:05,000 --> 00:14:10,240
Because if you can basically stack these skills, so you got your computer science, you can

204
00:14:10,240 --> 00:14:17,440
ship your model and you know the statistics, so you're not making rookie mistakes and you

205
00:14:17,440 --> 00:14:26,360
actually know what you're doing, then you're just going to, you know, what someone in UC

206
00:14:26,360 --> 00:14:29,920
Irvine told me, you're going to be selling like hotcakes, right.

207
00:14:29,920 --> 00:14:33,920
Because you're actually going to be able to build and implement statistics that people

208
00:14:33,920 --> 00:14:34,920
need.

209
00:14:34,920 --> 00:14:37,360
So, too cool.

210
00:14:37,360 --> 00:14:40,520
But enough of hammering that home.

211
00:14:40,520 --> 00:14:44,920
But let's just go ahead and kick off today, right.

212
00:14:44,920 --> 00:14:55,280
And show you, you know, what you can actually do, you know, with statistics.

213
00:14:55,280 --> 00:14:56,800
And computer science too, right.

214
00:14:56,800 --> 00:15:01,200
Today's actually going to be a computer science heavy day.

215
00:15:01,200 --> 00:15:09,680
So just to go back, so I always like to look at the history of some of these technologies,

216
00:15:09,680 --> 00:15:16,640
because you know, John likes to remind me and I always end up discovering a lot of these

217
00:15:16,640 --> 00:15:19,000
technologies aren't new.

218
00:15:19,000 --> 00:15:24,520
They're just better implementations of old technology.

219
00:15:24,520 --> 00:15:29,280
And so, you know, if you go, you know, hit Wikipedia, and so I didn't do anything fancy,

220
00:15:29,280 --> 00:15:32,400
so you know, you know, look into this too.

221
00:15:32,400 --> 00:15:39,640
So it looks like there was a character, Emanuel Goldberg, who started, interestingly, it looks

222
00:15:39,640 --> 00:15:46,400
like, parsing text from what I believe were early movie film.

223
00:15:46,400 --> 00:15:52,360
So maybe there were titles or credits that needed to be parsed.

224
00:15:52,360 --> 00:16:02,920
So this, what's the right word, innovative character came up with a way to programmatically,

225
00:16:02,920 --> 00:16:08,360
or, you know, through a machine, get characters.

226
00:16:08,360 --> 00:16:11,000
And I guess he was turning them into telegraph code.

227
00:16:11,000 --> 00:16:17,520
So really, really early stages of what we'd call optical character recognition.

228
00:16:17,520 --> 00:16:27,640
And this is where it's like, you know, it's such a small, funny, intertwined world, right.

229
00:16:27,640 --> 00:16:36,760
Because this first device, right, the very first device to use optical character, optical

230
00:16:36,760 --> 00:16:43,880
character recognition, used my component Selenium photo sensors.

231
00:16:43,880 --> 00:16:51,040
And just, I got a laugh out of this because we're actually using a Python package, Selenium.

232
00:16:51,040 --> 00:17:00,960
So we're also using Selenium in optical character recognition, just a much different type of

233
00:17:00,960 --> 00:17:03,160
Selenium.

234
00:17:03,160 --> 00:17:11,120
So I don't know if you get a laugh out of that, those are the types of jokes that Courtney

235
00:17:11,120 --> 00:17:12,760
data scientists make me.

236
00:17:12,760 --> 00:17:15,400
Don't put that in your write up, Hector.

237
00:17:15,400 --> 00:17:20,600
It'll be a discouraging note.

238
00:17:20,600 --> 00:17:30,040
And then this character, so I recognize this name, so Ray Kurzweil, a technologist, I

239
00:17:30,040 --> 00:17:32,040
guess one would say.

240
00:17:32,040 --> 00:17:41,640
Just interestingly enough, Ray Kurzweil started a company back in 1974 that essentially scanned

241
00:17:41,640 --> 00:17:50,320
documents and I guess I assume digitized them by recognizing the printed font.

242
00:17:50,320 --> 00:17:56,760
So once again, you know, here I'm thinking like, you know, I'm bragging to you that,

243
00:17:56,760 --> 00:17:59,760
you know, oh, you know, we're using this cutting edge technology.

244
00:17:59,760 --> 00:18:07,000
But really, this is technology that Ray Kurzweil has been using for almost 50 years now.

245
00:18:07,000 --> 00:18:18,240
So you know, so we're just so I get inspiration out of this in that it makes me think that,

246
00:18:18,240 --> 00:18:25,480
okay, you know, now we can just kind of use OCR just as sort of like a passive tool, almost

247
00:18:25,480 --> 00:18:27,480
like one would just use like a screwdriver.

248
00:18:27,480 --> 00:18:29,040
We're not quite there yet.

249
00:18:29,040 --> 00:18:31,480
It was still kind of a little complicated to use.

250
00:18:31,480 --> 00:18:38,000
But you know, it'd be nice to think that this is going to be one of the standard tools.

251
00:18:38,000 --> 00:18:44,560
And you know, just to show you that, you know, you can make a buck from parsing documents,

252
00:18:44,560 --> 00:18:49,840
right, to the LexisNexis, parsing legal documents.

253
00:18:49,840 --> 00:18:55,740
And then once we hit the 2000s, there's tons of web-based applications.

254
00:18:55,740 --> 00:19:00,360
But I don't think there's that much open source optical character recognition.

255
00:19:00,360 --> 00:19:10,060
At least, I mean, I know the PIE, the Tesseract, which we're using was open sourced in 2018.

256
00:19:10,060 --> 00:19:13,920
So there may have been some open source technology before then.

257
00:19:13,920 --> 00:19:17,160
But so you'll have to look into that yourself.

258
00:19:17,160 --> 00:19:22,720
But that may have been a hurdle, because maybe a lot of I know, like this first technology

259
00:19:22,720 --> 00:19:23,720
was patented.

260
00:19:23,720 --> 00:19:29,280
So it could have been some of this technology was patented for a while.

261
00:19:29,280 --> 00:19:38,520
But that's where my research ends and my questions begin.

262
00:19:38,520 --> 00:19:47,280
But here's sort of the cannabis data science application is okay.

263
00:19:47,280 --> 00:19:50,880
In certain states, it's mandated.

264
00:19:50,880 --> 00:19:54,960
In most states, they're there for the asking.

265
00:19:54,960 --> 00:20:01,560
But if you go and buy a cannabis product, they're in almost every state that I know

266
00:20:01,560 --> 00:20:05,440
of, quality assurance is mandated.

267
00:20:05,440 --> 00:20:10,720
So the producer has to send a sample to the laboratory to get tested for cannabinoids

268
00:20:10,720 --> 00:20:14,800
and contaminants.

269
00:20:14,800 --> 00:20:19,040
This is interesting data, in my opinion, right?

270
00:20:19,040 --> 00:20:24,040
So for example, I'm working with the cannabis diary.

271
00:20:24,040 --> 00:20:29,160
It's a mobile app that consumers can use to track their consumption.

272
00:20:29,160 --> 00:20:30,160
Right.

273
00:20:30,160 --> 00:20:36,360
And so the idea is, you know, you can trend your cannabinoid consumption over time.

274
00:20:36,360 --> 00:20:38,360
There's a lot more to the app than that.

275
00:20:38,360 --> 00:20:42,800
But but this is, you know, in use for the data.

276
00:20:42,800 --> 00:20:48,120
But you'll have to get the data in a manner.

277
00:20:48,120 --> 00:20:52,840
And so the way you would do such is, I mean, there's lots of ways, right?

278
00:20:52,840 --> 00:20:56,600
And this is why I was saying reach out to laboratories.

279
00:20:56,600 --> 00:21:05,880
So for example, I reached out to all the laboratories in Washington state to see if I, you know,

280
00:21:05,880 --> 00:21:11,580
anywhere interested in, you know, trying to make this data easily accessible to consumers

281
00:21:11,580 --> 00:21:22,200
in a similar way that say SC labs or MCR labs in SC labs in California, MCR labs in Massachusetts

282
00:21:22,200 --> 00:21:23,200
respect.

283
00:21:23,200 --> 00:21:27,640
So I was just seeing, OK, are any of you interested in one one was.

284
00:21:27,640 --> 00:21:31,520
And so there may be a further information there.

285
00:21:31,520 --> 00:21:37,320
I'll kind of tease that and maybe talk about that more next week.

286
00:21:37,320 --> 00:21:39,720
But that's just the start.

287
00:21:39,720 --> 00:21:45,800
So for example, this is why I was saying you may want to think and think about reaching

288
00:21:45,800 --> 00:21:53,520
out to the labs in Maryland, New Jersey and New York and just say, hey, you know, are

289
00:21:53,520 --> 00:21:59,320
you interested in maybe sharing a template of your COA so that way we can make sure that

290
00:21:59,320 --> 00:22:02,080
we can parse it.

291
00:22:02,080 --> 00:22:07,320
But then what I'm interested in is, is, is how well does this work in the wild?

292
00:22:07,320 --> 00:22:12,800
And so that's what we may start testing out here in the.

293
00:22:12,800 --> 00:22:13,800
The near future.

294
00:22:13,800 --> 00:22:14,800
Right.

295
00:22:14,800 --> 00:22:17,560
Is now we've and we'll demo it today.

296
00:22:17,560 --> 00:22:23,640
You know, we have the ability to extract the data from the actual PDF.

297
00:22:23,640 --> 00:22:28,580
We can do it from an image of the PDF.

298
00:22:28,580 --> 00:22:33,040
Can we do it from a wild image?

299
00:22:33,040 --> 00:22:40,840
So if you know, because if somebody snaps a picture on their phone of a piece of paper,

300
00:22:40,840 --> 00:22:44,120
how good is the resolution going to be with phones these days?

301
00:22:44,120 --> 00:22:47,140
It's shockingly good.

302
00:22:47,140 --> 00:22:49,800
How lined up is the paper going to be?

303
00:22:49,800 --> 00:22:52,280
So you may have to do some fancy stuff there.

304
00:22:52,280 --> 00:22:55,640
But, you know, where there is a will, there's a way.

305
00:22:55,640 --> 00:23:00,240
So that's sort of, you know, on the road map is, you know, because how cool would that

306
00:23:00,240 --> 00:23:05,440
be, right, if you could just take a picture of a certificate and then all of a sudden

307
00:23:05,440 --> 00:23:09,200
all that data is now yours, which it should be.

308
00:23:09,200 --> 00:23:10,200
Right.

309
00:23:10,200 --> 00:23:12,960
Because these are tests mandated for your safety.

310
00:23:12,960 --> 00:23:13,960
Right.

311
00:23:13,960 --> 00:23:17,200
Or just your your knowledge.

312
00:23:17,200 --> 00:23:23,080
So then you can, you know, take that data and trend it.

313
00:23:23,080 --> 00:23:30,200
I mean, this is the thing that George and everyone else is, you know, once you've

314
00:23:30,200 --> 00:23:33,840
learned, say, time series statistics, you just want to look at everything as a time

315
00:23:33,840 --> 00:23:34,840
series.

316
00:23:34,840 --> 00:23:42,400
So, so anywho, and then in particular.

317
00:23:42,400 --> 00:23:46,600
So that's just a computer science exercise.

318
00:23:46,600 --> 00:23:49,080
And then we actually need some statistics.

319
00:23:49,080 --> 00:23:50,080
Right.

320
00:23:50,080 --> 00:23:56,440
And so the idea is, OK, what do we actually do with this data?

321
00:23:56,440 --> 00:24:02,320
So once we've gotten this data, which would basically be X, so we'll just have all the

322
00:24:02,320 --> 00:24:09,000
data so you can think about X is just a vector of lab result data.

323
00:24:09,000 --> 00:24:16,720
So once you have all of that lab result data, could you actually use that to predict products

324
00:24:16,720 --> 00:24:22,720
that a consumer may enjoy?

325
00:24:22,720 --> 00:24:30,480
And then before I open up to questions, I would just like to say to my background in

326
00:24:30,480 --> 00:24:35,320
economics and so enjoy has a pretty specific meaning to me.

327
00:24:35,320 --> 00:24:39,960
And so, you know, to me, this is basically like, OK, you know, the, you know, the consumers

328
00:24:39,960 --> 00:24:45,400
trying to maximize their utility from this product.

329
00:24:45,400 --> 00:24:49,440
So actually, the enjoy is a super loose word.

330
00:24:49,440 --> 00:24:50,440
Right.

331
00:24:50,440 --> 00:24:52,880
So that could mean, right.

332
00:24:52,880 --> 00:24:57,920
Why are the, you know, we're basically I'm just assuming, OK, this is a utility maximizing

333
00:24:57,920 --> 00:24:59,360
consumer.

334
00:24:59,360 --> 00:25:02,960
So enjoy is actually kind of a little subjective, I guess.

335
00:25:02,960 --> 00:25:09,720
So is that their, you know, well-being, you know, are they a medical consumer or are they

336
00:25:09,720 --> 00:25:12,960
a recreational adult use consumer?

337
00:25:12,960 --> 00:25:15,000
So there anywho.

338
00:25:15,000 --> 00:25:21,200
But any thoughts, comments, questions about the questions at hand before open it up to

339
00:25:21,200 --> 00:25:24,680
show you how we can solve this?

340
00:25:24,680 --> 00:25:33,960
Well, before I bore you to death, let's just go ahead and talk about the algorithm and

341
00:25:33,960 --> 00:25:37,400
then we can go ahead and read in the data.

342
00:25:37,400 --> 00:25:45,160
So long story short is, you know, last week, we were basically just recommending products

343
00:25:45,160 --> 00:25:47,200
off of the average concentration.

344
00:25:47,200 --> 00:25:56,640
So we were just saying, OK, can we find products that are chemically similar to products that

345
00:25:56,640 --> 00:26:00,600
a consumer's purchased in the past?

346
00:26:00,600 --> 00:26:02,760
And there was a typo in the code last week.

347
00:26:02,760 --> 00:26:04,880
I'll show you how we can fix that.

348
00:26:04,880 --> 00:26:08,800
In the meetup, we didn't get to it, but the code's there.

349
00:26:08,800 --> 00:26:15,720
But yes, you know, we can find the most chemically similar products.

350
00:26:15,720 --> 00:26:23,600
But a question was raised, well, what should it be taken into consideration how much the

351
00:26:23,600 --> 00:26:27,240
consumer actually likes a product?

352
00:26:27,240 --> 00:26:35,280
And so that's where, does anybody remember sentiment analysis?

353
00:26:35,280 --> 00:26:48,680
So long story short is, we used reviews in the past to basically rank the consumer's

354
00:26:48,680 --> 00:26:53,240
sentiment of a product from zero to one.

355
00:26:53,240 --> 00:27:01,840
So instead of just taking an average of the cannabinoid concentrations, you could basically

356
00:27:01,840 --> 00:27:05,520
take a weighted average.

357
00:27:05,520 --> 00:27:19,800
So if you think about the edge case, so the binary case, zero or one, zero, the consumer

358
00:27:19,800 --> 00:27:25,680
doesn't like the product, one, the consumer likes the product.

359
00:27:25,680 --> 00:27:36,320
Then you would just take the average of the products that the consumer likes.

360
00:27:36,320 --> 00:27:39,040
So that's the easiest case.

361
00:27:39,040 --> 00:27:47,480
And that was sort of what I just kind of conjectured last week was a technique that you could do.

362
00:27:47,480 --> 00:27:50,760
And now we've formally put it into math.

363
00:27:50,760 --> 00:27:54,240
And so this is just a weighted average.

364
00:27:54,240 --> 00:28:00,440
George, you know, this is nothing fancy, but this is one of the reasons why it's good to

365
00:28:00,440 --> 00:28:06,880
get intimate with statistics, because yes, you'll go way above and beyond, but you'll

366
00:28:06,880 --> 00:28:13,800
also become intimately, probably too intimately familiar with weighted averages.

367
00:28:13,800 --> 00:28:20,120
And just, you know, you can go deep with statistics.

368
00:28:20,120 --> 00:28:25,320
So like I said, I would encourage you to pick up some of those skills because I think they're

369
00:28:25,320 --> 00:28:26,320
useful.

370
00:28:26,320 --> 00:28:32,400
But anywho, you know, that's the edge case, zero or one.

371
00:28:32,400 --> 00:28:42,200
And then what you can do is just let theta, so this parameter, vary between zero and one.

372
00:28:42,200 --> 00:28:51,280
And then you'd get your continuous moving average, where the closer it is to zero, the

373
00:28:51,280 --> 00:28:57,120
less the consumer like the product with 0.5 being neutral.

374
00:28:57,120 --> 00:29:02,560
And then everything, you know, greater than 0.5 to one, the consumer generally viewed

375
00:29:02,560 --> 00:29:05,600
as positive.

376
00:29:05,600 --> 00:29:08,000
Okay.

377
00:29:08,000 --> 00:29:15,360
Well, that's the math behind everything.

378
00:29:15,360 --> 00:29:22,800
And so now let's just, any questions before we dive in?

379
00:29:22,800 --> 00:29:29,440
Is there survey data for the consumer sentiment or are you going to try to infer that?

380
00:29:29,440 --> 00:29:31,680
I'm not quite clear.

381
00:29:31,680 --> 00:29:39,760
We're going to use Leafly review data, so these are fictitious consumers.

382
00:29:39,760 --> 00:29:45,280
So we're just saying, oh, given this, well, there is, I guess, a real person behind the

383
00:29:45,280 --> 00:29:48,240
username, but to us, they're just usernames.

384
00:29:48,240 --> 00:29:56,320
But yes, so we'll basically be making product recommendations for, I suppose, real users

385
00:29:56,320 --> 00:29:57,680
on Leafly.

386
00:29:57,680 --> 00:30:04,240
So I suppose if any Leafly users are watching this and we happen to use your namespace,

387
00:30:04,240 --> 00:30:09,520
then these may be product recommendations for you.

388
00:30:09,520 --> 00:30:11,560
So this is...

389
00:30:11,560 --> 00:30:20,080
So are you, I'm sorry, are you accessing a Leafly scale where it's like or don't like,

390
00:30:20,080 --> 00:30:24,280
or is it one of the response category scales or something?

391
00:30:24,280 --> 00:30:30,160
There's a number of them in the Leafly scales.

392
00:30:30,160 --> 00:30:31,160
Excellent question, John.

393
00:30:31,160 --> 00:30:40,320
So essentially we're just working with the free form review.

394
00:30:40,320 --> 00:30:42,760
Oh, the text.

395
00:30:42,760 --> 00:30:43,760
Exactly.

396
00:30:43,760 --> 00:30:52,800
So given the text, we're using natural language processing to estimate a score between zero

397
00:30:52,800 --> 00:30:59,720
and one of how positive we think this consumer was with their review.

398
00:30:59,720 --> 00:31:01,600
Okay.

399
00:31:01,600 --> 00:31:06,640
But lots to showcase real quick.

400
00:31:06,640 --> 00:31:12,720
So just going to start reading in a bunch of helpful packages here, and then we'll just

401
00:31:12,720 --> 00:31:20,000
start walking through some of the code because we've got some cool things here today.

402
00:31:20,000 --> 00:31:29,440
So really just reading in a bunch of helpful Python packages.

403
00:31:29,440 --> 00:31:33,360
If you're interested, I'll post the code to GitHub.

404
00:31:33,360 --> 00:31:39,280
So I'll recommend that you may want to read through it if you're interested in the code.

405
00:31:39,280 --> 00:31:44,320
But otherwise I'm just going to start working through here so that way we can get to some

406
00:31:44,320 --> 00:31:50,880
of the interesting work we're doing here.

407
00:31:50,880 --> 00:31:52,680
Okay, cool.

408
00:31:52,680 --> 00:32:03,840
So in this repository, we've got a PDF.

409
00:32:03,840 --> 00:32:12,280
And so this is a certificate of analysis that you may receive from a laboratory or from,

410
00:32:12,280 --> 00:32:21,920
say, if you're a consumer and you purchased Gaviota Mist at a retailer in California.

411
00:32:21,920 --> 00:32:26,360
Perhaps you asked the retailer, hey, can I have their certificate?

412
00:32:26,360 --> 00:32:29,000
And maybe they gave you this PDF.

413
00:32:29,000 --> 00:32:36,840
So as I said, the next step is can we do this if somebody just took free form pictures of

414
00:32:36,840 --> 00:32:37,840
each page?

415
00:32:37,840 --> 00:32:41,320
That's sort of going to be the next hurdle.

416
00:32:41,320 --> 00:32:46,480
But I'll show you something cool that we can do.

417
00:32:46,480 --> 00:32:56,840
So first things first, we'll want to just create just some temporary...

418
00:32:56,840 --> 00:33:08,840
Oops, I just must not have made it where we want it to be.

419
00:33:08,840 --> 00:33:35,600
Hopefully we've got the images folder.

420
00:33:35,600 --> 00:33:36,600
We must have made this...

421
00:33:36,600 --> 00:33:43,000
It's okay, we've made this over in the consumer recs folder by accident.

422
00:33:43,000 --> 00:33:44,000
Oh well.

423
00:33:44,000 --> 00:33:50,040
So anyway, actually, let me just...

424
00:33:50,040 --> 00:33:54,160
If you don't mind, let me just kind of restart this a little quick.

425
00:33:54,160 --> 00:33:56,840
Let's be a bit more straightforward this way.

426
00:33:56,840 --> 00:34:03,040
That way we're kind of saving things where we should be.

427
00:34:03,040 --> 00:34:13,360
All right, so anywho, we're going to need to parse this PDF.

428
00:34:13,360 --> 00:34:18,160
And so we've been doing that in the past, no big issue.

429
00:34:18,160 --> 00:34:27,700
But let's just have a look at this PDF real quick.

430
00:34:27,700 --> 00:34:34,760
So in particular, we're interested in this missed PDF.

431
00:34:34,760 --> 00:34:35,760
So missed.

432
00:34:35,760 --> 00:34:43,200
What we can do is we can read this in with PDF plumber.

433
00:34:43,200 --> 00:34:47,240
So as I always say, someone call a plumber.

434
00:34:47,240 --> 00:35:00,920
And so if we read in the document, then we can see the text of the document.

435
00:35:00,920 --> 00:35:06,000
So we can just look at the first page, get all the text.

436
00:35:06,000 --> 00:35:08,120
This is what we've been doing.

437
00:35:08,120 --> 00:35:13,000
Here I'll just print out the first 30 characters.

438
00:35:13,000 --> 00:35:22,560
Uh-oh, if we look at all the text, so this is all the text on the first page.

439
00:35:22,560 --> 00:35:26,920
We just get a bunch of CIDs.

440
00:35:26,920 --> 00:35:29,720
So that's my text.

441
00:35:29,720 --> 00:35:36,520
And so I've kind of looked into this, and this is just the standard that PDF plumber

442
00:35:36,520 --> 00:35:39,880
returns when it doesn't recognize a character.

443
00:35:39,880 --> 00:35:43,840
So we're not recognizing any of the characters on the page.

444
00:35:43,840 --> 00:35:50,040
So instead of, so you could do a font lookup at this point, but going down that rabbit

445
00:35:50,040 --> 00:35:52,860
hole was unsuccessful for us.

446
00:35:52,860 --> 00:35:58,080
So what we decided to do was to implement optical character recognition.

447
00:35:58,080 --> 00:36:04,600
So at first it seemed like a daunting task, but actually I'll show you, it works out pretty

448
00:36:04,600 --> 00:36:06,700
well.

449
00:36:06,700 --> 00:36:15,880
So this is just a quick test just to see, oh, yes, we need optical character recognition.

450
00:36:15,880 --> 00:36:17,420
Cool.

451
00:36:17,420 --> 00:36:21,160
So how does one go about doing such?

452
00:36:21,160 --> 00:36:29,040
Well, interestingly enough, you actually first, you just start by converting all of these

453
00:36:29,040 --> 00:36:47,240
pages, this may throw an error actually, but we should start generating these images.

454
00:36:47,240 --> 00:37:01,960
So basically the first step, and I've got the algorithm up here, is one, I think, may

455
00:37:01,960 --> 00:37:16,840
not have created these directories.

456
00:37:16,840 --> 00:37:21,000
Sorry for the, hold up here.

457
00:37:21,000 --> 00:37:24,200
First thing first, turn them into indicates.

458
00:37:24,200 --> 00:37:37,480
I've got a feeling I just forgot the code up top to create that directory.

459
00:37:37,480 --> 00:37:43,640
And one thing is I'm currently cranking the resolution up pretty high.

460
00:37:43,640 --> 00:37:54,680
And so what I found is if the resolution's any less than 300, here we are, then it's

461
00:37:54,680 --> 00:37:55,680
no good.

462
00:37:55,680 --> 00:37:59,600
So here I'm just going to copy one of these images because they're temporary.

463
00:37:59,600 --> 00:38:10,800
So if you look at one of these, this is basically just an image of the first page of the PDF.

464
00:38:10,800 --> 00:38:19,360
So this is where I realized that the consumer could potentially take these pictures if we

465
00:38:19,360 --> 00:38:21,880
can get the orientation correct, right?

466
00:38:21,880 --> 00:38:26,120
Because we're basically now working with an image, right?

467
00:38:26,120 --> 00:38:33,560
So one could think if you lined your phone up perfectly, you could take such an image.

468
00:38:33,560 --> 00:38:40,520
And I'm curious if anyone else, if you want to look at this, can we work with a wild image

469
00:38:40,520 --> 00:38:45,440
of a COA?

470
00:38:45,440 --> 00:38:50,880
So that's the first step is basically create all the images.

471
00:38:50,880 --> 00:38:52,680
And then this is the fun part.

472
00:38:52,680 --> 00:38:56,840
And so we're standing on the shoulders of giants.

473
00:38:56,840 --> 00:39:03,560
So we're just using Tesseract to turn this image into a PDF.

474
00:39:03,560 --> 00:39:14,200
So Tesseract is an open source OCR library that you can install.

475
00:39:14,200 --> 00:39:18,240
Please get in touch with me if you need any help getting it installed.

476
00:39:18,240 --> 00:39:23,560
And then once it's installed, then you're essentially off to the races.

477
00:39:23,560 --> 00:39:27,800
So we brought all these images here.

478
00:39:27,800 --> 00:39:31,840
So the six images of the six pages.

479
00:39:31,840 --> 00:39:35,520
And what file format are those images at this point?

480
00:39:35,520 --> 00:39:37,800
These are in PNG.

481
00:39:37,800 --> 00:39:39,480
.png, okay.

482
00:39:39,480 --> 00:39:40,760
So it was recommended.

483
00:39:40,760 --> 00:39:45,400
So we did a couple things here if anyone's interested.

484
00:39:45,400 --> 00:39:51,400
So this was a function I found on Stack Overflow.

485
00:39:51,400 --> 00:39:57,720
But technically, we're basically removing the alpha channel.

486
00:39:57,720 --> 00:40:05,640
And a computer scientist may need to help me with this.

487
00:40:05,640 --> 00:40:08,360
Perhaps this has to do something with transparency or?

488
00:40:08,360 --> 00:40:12,280
Yeah, the alpha channel is transparency.

489
00:40:12,280 --> 00:40:14,040
That's right.

490
00:40:14,040 --> 00:40:21,160
So as I said, I'm not 100% sure if this is necessary.

491
00:40:21,160 --> 00:40:25,400
But this was sort of the recommended way to turn a PDF to a PNG.

492
00:40:25,400 --> 00:40:30,280
So in fact, George or anyone else, that's sort of the benefit of having this code open

493
00:40:30,280 --> 00:40:37,480
sourced is feel free to look through this function and see if this is appropriate.

494
00:40:37,480 --> 00:40:40,440
Is a PNG in fact appropriate?

495
00:40:40,440 --> 00:40:44,120
Or would a different file format be appropriate?

496
00:40:44,120 --> 00:40:52,960
But I think what this is doing is if the PDF has a transparent background, I think we're

497
00:40:52,960 --> 00:40:59,760
just giving it a white background so we can recognize the text better.

498
00:40:59,760 --> 00:41:07,440
So once again, it may be overkill, but that's part of the algorithm.

499
00:41:07,440 --> 00:41:09,480
So important.

500
00:41:09,480 --> 00:41:12,680
It could be important.

501
00:41:12,680 --> 00:41:17,720
And then essentially, just one by one, this is where we're actually using the optical

502
00:41:17,720 --> 00:41:19,800
character recognition.

503
00:41:19,800 --> 00:41:30,320
And like I said, this is what I was talking about, like OCR as a almost like a screwdriver.

504
00:41:30,320 --> 00:41:35,520
So we're just an OCR company.

505
00:41:35,520 --> 00:41:38,760
It's just a tool that we use.

506
00:41:38,760 --> 00:41:44,840
So it's like when you're building a house, you may use a screwdriver.

507
00:41:44,840 --> 00:41:52,120
So when you're building an app, you may use OCR.

508
00:41:52,120 --> 00:41:58,360
So now you can actually recognize this text here.

509
00:41:58,360 --> 00:42:05,320
So I'll prove that we can do that here in Python.

510
00:42:05,320 --> 00:42:08,680
So now we've got missed zero.

511
00:42:08,680 --> 00:42:11,480
So now let's see if we can't.

512
00:42:11,480 --> 00:42:13,680
So this is what I call a sanity check.

513
00:42:13,680 --> 00:42:21,320
So as you're programming, it's worth just saying, oh, can we just read and miss zero

514
00:42:21,320 --> 00:42:24,200
before we go too much further?

515
00:42:24,200 --> 00:42:27,920
Can we actually get the text on this page?

516
00:42:27,920 --> 00:42:34,920
And drum roll.

517
00:42:34,920 --> 00:42:37,800
We don't have a bunch of CIDs.

518
00:42:37,800 --> 00:42:44,400
And so now the idea is now you can just go about parsing as usual.

519
00:42:44,400 --> 00:42:48,280
And so it's not necessarily an easy feat.

520
00:42:48,280 --> 00:42:54,400
So PDF Plumber has methods for extracting tables.

521
00:42:54,400 --> 00:43:00,880
If any of you think of clever algorithms for how to parse the actual data, that's kind

522
00:43:00,880 --> 00:43:09,160
of where I think the main value is added, because we're basically using tools to recognize

523
00:43:09,160 --> 00:43:12,680
all the text.

524
00:43:12,680 --> 00:43:21,200
And then basically, the value added part is writing a custom algorithm to essentially

525
00:43:21,200 --> 00:43:28,320
be able to handle all the nuances of the formatting choices that people chose.

526
00:43:28,320 --> 00:43:37,480
And like I said, the grand scheme of this is to create a generalized parsing algorithm.

527
00:43:37,480 --> 00:43:45,960
So that way, the algorithm is simply smart enough to recognize that this 86.04% corresponds

528
00:43:45,960 --> 00:43:49,280
to total can of annoyance.

529
00:43:49,280 --> 00:43:50,680
Easier said than done.

530
00:43:50,680 --> 00:43:57,680
But if you're clever, capitalize on this opportunity.

531
00:43:57,680 --> 00:44:02,280
But this is sort of what we've been working on.

532
00:44:02,280 --> 00:44:10,800
So if you're interested, we basically have a specific algorithm to parse the COA.

533
00:44:10,800 --> 00:44:12,040
It's not glamorous.

534
00:44:12,040 --> 00:44:21,280
It just uses the way the text is laid out.

535
00:44:21,280 --> 00:44:27,000
But this is the important part, is just proving that you can in fact read the text.

536
00:44:27,000 --> 00:44:33,000
So as I said before, it was just all CID.

537
00:44:33,000 --> 00:44:35,400
That's useless to us.

538
00:44:35,400 --> 00:44:39,600
And now we actually have useful data.

539
00:44:39,600 --> 00:44:42,000
And as I said, it came from an image.

540
00:44:42,000 --> 00:44:47,040
So that's just kind of cool.

541
00:44:47,040 --> 00:44:54,320
And then just to kind of show you this in the wild, technically, what you can do now

542
00:44:54,320 --> 00:45:02,440
is right now you have six individual PDFs.

543
00:45:02,440 --> 00:45:07,720
Well, just merge them back together.

544
00:45:07,720 --> 00:45:15,440
And then here I'm just removing the unused PDFs just because I just, right, George, this

545
00:45:15,440 --> 00:45:17,160
is something that you'll pick up.

546
00:45:17,160 --> 00:45:18,160
Right.

547
00:45:18,160 --> 00:45:22,120
And so this is why it's useful to have the computer science background is you just kind

548
00:45:22,120 --> 00:45:28,720
of have to have that foresight that, yeah, we can't have thousands of thousands of high

549
00:45:28,720 --> 00:45:33,640
resolution PDFs just redundantly laying around.

550
00:45:33,640 --> 00:45:36,960
That's just going to be a lot of storage.

551
00:45:36,960 --> 00:45:41,720
So I'm just doing a little bit of data cleanup.

552
00:45:41,720 --> 00:45:50,280
But what did we, we did save this, right?

553
00:45:50,280 --> 00:45:52,680
We just saved it as the test PDF.

554
00:45:52,680 --> 00:45:57,440
And so here is just the PDF.

555
00:45:57,440 --> 00:46:04,480
And so now, you know, we should be able to say, OK, let's say this doc is now the test

556
00:46:04,480 --> 00:46:07,720
PDF.

557
00:46:07,720 --> 00:46:18,600
You now can just say, OK, you know, for, well, first you have to read it in.

558
00:46:18,600 --> 00:46:23,440
So you open the doc and then you can just say, OK, you know, for page, you know, and

559
00:46:23,440 --> 00:46:35,040
report dot pages, just extract the text from each page and handle that accordingly.

560
00:46:35,040 --> 00:46:39,640
And so just bear yourself a bunch of texts.

561
00:46:39,640 --> 00:46:41,680
So wall of text.

562
00:46:41,680 --> 00:46:44,080
So you've heard of the wall of sound.

563
00:46:44,080 --> 00:46:47,080
Well, now we give you the wall of text.

564
00:46:47,080 --> 00:46:54,480
So anywho, now you have, you know, all the data from the COA.

565
00:46:54,480 --> 00:46:59,440
And you know, it could be worse.

566
00:46:59,440 --> 00:47:05,840
You know, to be honest, we did some patent scraping and legal documents are honestly

567
00:47:05,840 --> 00:47:07,480
much worse right there.

568
00:47:07,480 --> 00:47:09,920
The formatting is great.

569
00:47:09,920 --> 00:47:13,320
Like every single legal document may be different here.

570
00:47:13,320 --> 00:47:20,480
You know, at least every every certificate from canalysis labs is the same.

571
00:47:20,480 --> 00:47:26,960
So just to go ahead to just to quit teasing you to there's all the data.

572
00:47:26,960 --> 00:47:28,960
Well, we want that data.

573
00:47:28,960 --> 00:47:35,400
And so, like I said, if you're interested in the algorithm, check it out.

574
00:47:35,400 --> 00:47:44,360
But we should just be able to use one of these tools we built, COA doc, and we just say,

575
00:47:44,360 --> 00:47:47,160
hey, you know, let's check this out.

576
00:47:47,160 --> 00:47:49,560
Starting with.

577
00:47:49,560 --> 00:47:54,720
Let's start with the original mist that we couldn't identify.

578
00:47:54,720 --> 00:47:58,800
So this is the doc that we couldn't identify.

579
00:47:58,800 --> 00:48:02,760
Confirm.

580
00:48:02,760 --> 00:48:05,920
Now from role.

581
00:48:05,920 --> 00:48:09,720
So it's remember this is going to take a hot second.

582
00:48:09,720 --> 00:48:16,080
And I think we can hopefully watch it happen as it happens.

583
00:48:16,080 --> 00:48:27,520
But, you know, essentially just packaged this same functionality, the convert to images,

584
00:48:27,520 --> 00:48:28,520
convert to PDF.

585
00:48:28,520 --> 00:48:33,920
See, there are the images being generated.

586
00:48:33,920 --> 00:48:35,680
Then we don't need the images anymore.

587
00:48:35,680 --> 00:48:40,600
So I'll remove those once the PDFs are created.

588
00:48:40,600 --> 00:48:43,920
So here you see PDF 01.

589
00:48:43,920 --> 00:48:50,560
And, you know, so we'll convert all the pages.

590
00:48:50,560 --> 00:48:58,800
And then at that point, you know, COA doc has a bunch of built in algorithms.

591
00:48:58,800 --> 00:49:01,600
I wonder.

592
00:49:01,600 --> 00:49:08,920
Sorry, I built kind of over.

593
00:49:08,920 --> 00:49:10,920
Kind of over blowing the memory here.

594
00:49:10,920 --> 00:49:11,920
But well.

595
00:49:11,920 --> 00:49:16,880
Like I said, this is the important part.

596
00:49:16,880 --> 00:49:29,480
So this is this is the value that I think analytics added is now given this PDF, which,

597
00:49:29,480 --> 00:49:35,120
you know, originally was just as good as an image.

598
00:49:35,120 --> 00:49:42,160
We can now get all the data from it, in particular, all these results.

599
00:49:42,160 --> 00:49:51,920
So for example, you've got your Delta 9 THC, 83.27.

600
00:49:51,920 --> 00:49:56,000
So that should be in this table.

601
00:49:56,000 --> 00:50:01,320
So 83.27 for Delta 9 THC.

602
00:50:01,320 --> 00:50:05,000
So now you don't have to do any manual data entry.

603
00:50:05,000 --> 00:50:14,880
So now, you know, see, you want to trend the milligrams of THC that you're consuming, which

604
00:50:14,880 --> 00:50:19,040
is something that I'm interested in doing, because I heard a saying, you know, if you're

605
00:50:19,040 --> 00:50:21,680
not measuring it, you're not managing it.

606
00:50:21,680 --> 00:50:25,080
So I'm curious about this.

607
00:50:25,080 --> 00:50:29,920
And then, as I said, you know, basically, the kid with the cannabis diary app is doing

608
00:50:29,920 --> 00:50:37,040
is recording consumption in tandem with, say, your activities of the day.

609
00:50:37,040 --> 00:50:39,840
And, you know, maybe your reviews as well.

610
00:50:39,840 --> 00:50:45,960
So that way you can see, you know, or these compounds, you know, having a positive or

611
00:50:45,960 --> 00:50:47,600
a negative effect on you.

612
00:50:47,600 --> 00:50:48,600
Right.

613
00:50:48,600 --> 00:50:51,360
So was it a bad day?

614
00:50:51,360 --> 00:50:54,560
How many milligrams of THC did you consume?

615
00:50:54,560 --> 00:50:55,680
Or was it a good day?

616
00:50:55,680 --> 00:50:57,320
How many milligrams did you consume?

617
00:50:57,320 --> 00:51:04,840
And like I said, you know, you may be over consuming, you may be under consuming, say,

618
00:51:04,840 --> 00:51:07,360
if this is something you need.

619
00:51:07,360 --> 00:51:09,640
This is 100% up to you.

620
00:51:09,640 --> 00:51:15,480
But like I said, this is just something that I personally was interested in.

621
00:51:15,480 --> 00:51:20,040
And that's George, you may have already figured this out.

622
00:51:20,040 --> 00:51:25,840
But that's kind of how you start with the, that's how you can start with a good app.

623
00:51:25,840 --> 00:51:32,600
Because if you just find a problem in your own life and say, hey, you know, how can I

624
00:51:32,600 --> 00:51:33,600
fix this?

625
00:51:33,600 --> 00:51:38,280
Or, you know, you can find a problem in someone else's life or a problem with someone else's

626
00:51:38,280 --> 00:51:40,760
business and solve that, right?

627
00:51:40,760 --> 00:51:42,720
That's what business models are all about.

628
00:51:42,720 --> 00:51:47,280
But this is a method for like a consumer app.

629
00:51:47,280 --> 00:51:49,840
So what's the problem in my life?

630
00:51:49,840 --> 00:51:51,520
How can I solve it?

631
00:51:51,520 --> 00:51:54,680
Can I build an app to solve that problem?

632
00:51:54,680 --> 00:51:57,720
Would other people be interested in using that app?

633
00:51:57,720 --> 00:52:05,240
So to long story short is, you know, now we can not only just get the THC and CBD, which

634
00:52:05,240 --> 00:52:09,040
maybe, you know, you could enter that in by hand.

635
00:52:09,040 --> 00:52:18,040
But now we get alpha-cedrine, alpha-pinene, alpha-terpinial.

636
00:52:18,040 --> 00:52:27,720
You know, it's kind of miraculous, the amount of compounds that are in a cannabis plant.

637
00:52:27,720 --> 00:52:37,240
And good folks like John at the CESC are currently researching what in the world makes some of

638
00:52:37,240 --> 00:52:43,520
these compounds, like what effect may they have on you?

639
00:52:43,520 --> 00:52:45,480
And different people, right?

640
00:52:45,480 --> 00:52:48,240
Everyone's got different biochemistries.

641
00:52:48,240 --> 00:52:54,320
And, you know, this is where I mean, my knowledge is already separate.

642
00:52:54,320 --> 00:53:01,400
So it's like, you know, first things first, I'd be interested in trending THC and CBD.

643
00:53:01,400 --> 00:53:06,640
But then what's going on with all of these analytes?

644
00:53:06,640 --> 00:53:12,800
You know, maybe they have meaningful, meaningful effects.

645
00:53:12,800 --> 00:53:19,360
And so we're essentially using these, you know, this is what I talked about, our X bar.

646
00:53:19,360 --> 00:53:27,800
So all of these, this is basically our, we're basically going to create a vector of all

647
00:53:27,800 --> 00:53:34,000
of these results.

648
00:53:34,000 --> 00:53:38,080
So to kind of, so that's the OCR.

649
00:53:38,080 --> 00:53:42,600
I mean, you know, I promised you a data science exercise.

650
00:53:42,600 --> 00:53:50,040
So if you can hang out for 10 more minutes, then I'll go ahead and show you here how you

651
00:53:50,040 --> 00:53:56,400
could actually use this data.

652
00:53:56,400 --> 00:54:09,720
So I think if I just change this data directory, then everything should be okay.

653
00:54:09,720 --> 00:54:11,480
All right.

654
00:54:11,480 --> 00:54:21,440
So now we're just going to use a big data set of these lab results.

655
00:54:21,440 --> 00:54:24,280
And so the idea is, okay, awesome.

656
00:54:24,280 --> 00:54:31,840
You know, we now have a mechanism to parse all of this data.

657
00:54:31,840 --> 00:54:37,960
You know, and like I said, you can look at contaminants too.

658
00:54:37,960 --> 00:54:40,480
Because here, look at this.

659
00:54:40,480 --> 00:54:45,600
So this is actually an important thing that I kind of want to take a tangent on, just

660
00:54:45,600 --> 00:54:48,520
to kind of help, right, it's Canvas data science after all.

661
00:54:48,520 --> 00:54:54,320
So just to kind of help people understand what's going on in the cannabis industry.

662
00:54:54,320 --> 00:55:01,980
So this here is a concentrate.

663
00:55:01,980 --> 00:55:05,060
So this is concentrated cannabis.

664
00:55:05,060 --> 00:55:17,120
So what people do is they use some sort of solvent to basically concentrate down the

665
00:55:17,120 --> 00:55:21,920
cannabinoids, and then typically these solvents aren't something that you want to be consuming.

666
00:55:21,920 --> 00:55:27,000
They're typically purged, well, not typically, they are, you know, they try to purge as much

667
00:55:27,000 --> 00:55:32,000
of this as possible, ideally 100% from the product.

668
00:55:32,000 --> 00:55:37,840
But, you know, as we've kind of found out with chemistry, you know, dealing with 100%

669
00:55:37,840 --> 00:55:38,840
are difficult.

670
00:55:38,840 --> 00:55:47,760
And so the idea here is it looks like this is an ethanol concentrate, because basically

671
00:55:47,760 --> 00:55:56,120
it looks like this laboratory has a detection limit on their residual solvents.

672
00:55:56,120 --> 00:56:01,920
I'm not sure if it says anywhere right off the bat, but here it says.

673
00:56:01,920 --> 00:56:12,120
So they can detect down to 10, please correct me on this if I'm wrong, but I think this

674
00:56:12,120 --> 00:56:16,320
is, this may just be micrograms.

675
00:56:16,320 --> 00:56:19,480
Yeah, that's micrograms.

676
00:56:19,480 --> 00:56:26,000
Okay, so they can detect down to 10 micrograms.

677
00:56:26,000 --> 00:56:32,440
So they can only quantify up to 50 micrograms.

678
00:56:32,440 --> 00:56:46,080
So technically this product here has between 10 and 50 micrograms of ethanol in it.

679
00:56:46,080 --> 00:56:53,080
Per gram, you always have to normalize it to an M-gram.

680
00:56:53,080 --> 00:56:56,320
You may even have a, oh, it's a half gram concentrate.

681
00:56:56,320 --> 00:57:00,960
No, but the unit is micrograms per gram.

682
00:57:00,960 --> 00:57:08,640
I know, I was just wondering if I accidentally spoke correctly, even though I misspoke.

683
00:57:08,640 --> 00:57:10,260
John's correct.

684
00:57:10,260 --> 00:57:14,880
This is between 40 micrograms per gram.

685
00:57:14,880 --> 00:57:23,120
And that it's a half a gram concentrate, then you would expect around, you would write,

686
00:57:23,120 --> 00:57:31,400
your expectation would be 20 micrograms per gram of ethanol.

687
00:57:31,400 --> 00:57:38,040
And like I said, that's up to the consumer to decide that may be a negligible amount

688
00:57:38,040 --> 00:57:40,080
of ethanol.

689
00:57:40,080 --> 00:57:48,240
Right, so for example, somebody once told me that, I mean, if you're using a butane

690
00:57:48,240 --> 00:57:54,640
lighter, you're maybe consuming a non-negligible amount of butane.

691
00:57:54,640 --> 00:57:57,280
That's a possibility.

692
00:57:57,280 --> 00:58:03,800
That's actually some research that I would love to see somebody do with one of those

693
00:58:03,800 --> 00:58:04,800
inhalation machines.

694
00:58:04,800 --> 00:58:06,200
But those are at laboratories.

695
00:58:06,200 --> 00:58:10,760
But here, let me stay on track here.

696
00:58:10,760 --> 00:58:14,720
Anywho, this is kind of, this was one of the reasons I was thinking that it was important

697
00:58:14,720 --> 00:58:24,400
to parse these is this one less than the LOQ, you know, this one, this one may be negligible.

698
00:58:24,400 --> 00:58:30,960
But remember, you know, the action limit is 5,000.

699
00:58:30,960 --> 00:58:44,440
So technically, you could have a concentrate pass at 4,999 micrograms per gram.

700
00:58:44,440 --> 00:58:47,640
Probably a laboratory probably wouldn't pass that.

701
00:58:47,640 --> 00:58:49,640
I mean, they may.

702
00:58:49,640 --> 00:58:57,360
But right, I don't know, it'd be interesting to ask laboratories what their policy on

703
00:58:57,360 --> 00:58:59,520
that would be.

704
00:58:59,520 --> 00:59:06,720
But this is something that we looked at in Washington state is you do see high levels

705
00:59:06,720 --> 00:59:14,720
that do get passed, right, like 4,200 or 4,800 would probably pass.

706
00:59:14,720 --> 00:59:19,080
And as I said, that's up to the consumer to decide.

707
00:59:19,080 --> 00:59:21,760
100% up to the consumer.

708
00:59:21,760 --> 00:59:27,320
They may say, hey, consuming tons of butane, I don't really, I don't care about this small

709
00:59:27,320 --> 00:59:28,320
amount.

710
00:59:28,320 --> 00:59:31,240
Maybe they do care about the small amount.

711
00:59:31,240 --> 00:59:40,960
You know, I, I, for one, I primarily just care about detections of pesticides.

712
00:59:40,960 --> 00:59:41,960
So that's what I look at.

713
00:59:41,960 --> 00:59:48,640
I just, I love to see the full clean panel.

714
00:59:48,640 --> 00:59:54,880
Whether there's any logical sense to that, I'm not 100% certain yet.

715
00:59:54,880 --> 01:00:00,600
It's just one of those things where, you know, even if I see a hit, I'm just a little skeptical

716
01:00:00,600 --> 01:00:04,800
of, you know, are there hot spots in the flower?

717
01:00:04,800 --> 01:00:10,800
And maybe we just didn't test one of the hot spots or I don't know.

718
01:00:10,800 --> 01:00:13,080
But I guess every consumer has their own preferences.

719
01:00:13,080 --> 01:00:15,640
Those are just my own preferences.

720
01:00:15,640 --> 01:00:24,760
But the main point is, if you actually bought this concentrate at the store, they may have

721
01:00:24,760 --> 01:00:28,400
the total cannabinoids and the total THC label.

722
01:00:28,400 --> 01:00:38,120
But, you know, the chances of them having, you know, all 20 to 40 turkeys plus, you know,

723
01:00:38,120 --> 01:00:46,480
they're, you know, the, the retailer's not going to put this, say, technically, this

724
01:00:46,480 --> 01:00:47,840
is a detect, right?

725
01:00:47,840 --> 01:00:52,840
So they wouldn't put a, this product with, be detected ethanol in it.

726
01:00:52,840 --> 01:00:58,440
And, you know, if it's an ethanol concentrate, you know, like I said, it's chemistry, you

727
01:00:58,440 --> 01:01:01,400
know, you may not ever really be able to purge 100%.

728
01:01:01,400 --> 01:01:09,720
But anywho, I just kind of wanted to go down that rabbit hole, just because these are things

729
01:01:09,720 --> 01:01:15,740
that I kind of take for granted because I spent a little time at the laboratory.

730
01:01:15,740 --> 01:01:18,280
So I'm just kind of familiar with these abbreviations.

731
01:01:18,280 --> 01:01:26,720
But if these abbreviations are new to you, then maybe you've gained some insight there.

732
01:01:26,720 --> 01:01:29,240
But enough of that nonsense.

733
01:01:29,240 --> 01:01:32,280
Let's actually look at the data, right?

734
01:01:32,280 --> 01:01:37,080
That's what we, that's what we came here to do.

735
01:01:37,080 --> 01:01:46,600
Here, let's just work with the data file we were working with last week because it's behaving

736
01:01:46,600 --> 01:01:47,600
better.

737
01:01:47,600 --> 01:02:01,440
So the idea is these are just a bunch of values that we've already parsed.

738
01:02:01,440 --> 01:02:11,880
So if you look at, say, Raw Gardens website, you can see that they have, they actually

739
01:02:11,880 --> 01:02:16,640
have hundreds of these canalysis COAs.

740
01:02:16,640 --> 01:02:21,520
So that's why we originally wanted to parse these, is now we can parse hundreds of their

741
01:02:21,520 --> 01:02:22,520
COAs.

742
01:02:22,520 --> 01:02:30,760
And I think today we're just working with a sample of 100,000.

743
01:02:30,760 --> 01:02:37,560
So you know, if you plot the, so we're just going to start doing some plots.

744
01:02:37,560 --> 01:02:39,840
And I'll try to get through everything quickly.

745
01:02:39,840 --> 01:02:43,360
So thank you all for staying so long.

746
01:02:43,360 --> 01:02:47,840
So long story short is, you've got the data, let's look at it.

747
01:02:47,840 --> 01:02:51,520
So we're just going to care about two terpenes today.

748
01:02:51,520 --> 01:02:57,200
And as I said, you know, you could use 40.

749
01:02:57,200 --> 01:03:05,520
We're just going to use two today just as a proof of concept.

750
01:03:05,520 --> 01:03:12,600
And then we're just going to use a bunch of consumer reviews.

751
01:03:12,600 --> 01:03:24,480
I just realized the weighting average part may have an error.

752
01:03:24,480 --> 01:03:30,960
So I can at least walk you through the sentiment analysis.

753
01:03:30,960 --> 01:03:37,840
And then if we run into a, because I think the problem on this weighted average here

754
01:03:37,840 --> 01:03:48,720
is I would say we're working with a vector and not a one dimensional scalar.

755
01:03:48,720 --> 01:03:58,440
If that's the correct terminology, it's been a hot minute.

756
01:03:58,440 --> 01:04:08,720
In fact, I can maybe even try to maybe be able to change this to a scalar really, really

757
01:04:08,720 --> 01:04:11,320
quick while the data loads.

758
01:04:11,320 --> 01:04:21,440
So long story short is I'm going to be using the log of beta pinin to beta limenine as

759
01:04:21,440 --> 01:04:33,800
my predicting factor because we've talked about it in the past.

760
01:04:33,800 --> 01:04:37,560
So as I said, you can build your own prediction model.

761
01:04:37,560 --> 01:04:44,600
And that's why I was saying the fun part once you get the statistics down is actually just

762
01:04:44,600 --> 01:04:50,440
thinking of good variables to use and wrangling them.

763
01:04:50,440 --> 01:05:01,480
So the reviews, we have maybe 42,000 reviews.

764
01:05:01,480 --> 01:05:07,840
So that takes a hot minute to always kind of push my computer to its limit with the

765
01:05:07,840 --> 01:05:11,080
streaming at all.

766
01:05:11,080 --> 01:05:29,120
So just checking to make sure everybody's still here.

767
01:05:29,120 --> 01:05:34,280
See if we can't free up some bandwidth.

768
01:05:34,280 --> 01:05:44,640
So while the reviews are loading, I'll just go ahead and start wrapping up for today.

769
01:05:44,640 --> 01:05:48,720
And then we'll maybe look at one last visualization and then get out of here.

770
01:05:48,720 --> 01:05:51,880
I'll let you get out of here and get on with your day.

771
01:05:51,880 --> 01:05:58,000
But basically last week we said, OK, start somewhere and then iterate.

772
01:05:58,000 --> 01:06:03,680
And then I just wanted to build on that today and just say the place where you begin doesn't

773
01:06:03,680 --> 01:06:04,960
have to be perfect.

774
01:06:04,960 --> 01:06:14,640
And so this is, we're basically actually just using the lessons of gradient descent and

775
01:06:14,640 --> 01:06:18,840
actually putting those into application sort of in our real life.

776
01:06:18,840 --> 01:06:27,840
So this is sort of going to be a real life gradient descent function where it's, you

777
01:06:27,840 --> 01:06:34,560
just start somewhere and iterate and just, if you go somewhere and it's not the right

778
01:06:34,560 --> 01:06:37,080
direction, go backwards.

779
01:06:37,080 --> 01:06:41,560
And then if you go somewhere and it is in the right direction, just keep going more

780
01:06:41,560 --> 01:06:44,040
in that direction.

781
01:06:44,040 --> 01:06:46,840
That's gradient descent and in that shape.

782
01:06:46,840 --> 01:06:50,840
OK, so enough of that nonsense.

783
01:06:50,840 --> 01:07:00,560
OK, real quick, just to show you some cool statistics and then let you maybe go explore

784
01:07:00,560 --> 01:07:02,680
the data on your own.

785
01:07:02,680 --> 01:07:10,560
So we said, OK, we've got a whole bunch of reviews and I'll just say, oh, let's just

786
01:07:10,560 --> 01:07:15,560
look at this first one here.

787
01:07:15,560 --> 01:07:18,880
And basically it just would be a block of text.

788
01:07:18,880 --> 01:07:25,720
And so given a block of text, can we rank this 0 to 1?

789
01:07:25,720 --> 01:07:30,720
And so if you're particularly interested in this, we actually did a whole talk on sentiment

790
01:07:30,720 --> 01:07:31,720
analysis.

791
01:07:31,720 --> 01:07:33,120
I'm not sure if it's uploaded yet.

792
01:07:33,120 --> 01:07:37,280
If not, then hopefully soon.

793
01:07:37,280 --> 01:07:46,880
Just kind of want to put up these last couple of visualizations.

794
01:07:46,880 --> 01:07:52,080
We did cover a lot of ground with the optical character recognition, so we may want to see

795
01:07:52,080 --> 01:07:54,720
some of these consumer recommendations.

796
01:07:54,720 --> 01:07:59,320
But I've been droning on for a lot while these figures are rendering.

797
01:07:59,320 --> 01:08:04,680
Does anyone have any questions, comments or thoughts about some of the material we covered

798
01:08:04,680 --> 01:08:12,800
today?

799
01:08:12,800 --> 01:08:28,200
Basically here's the review and you could basically get a ranking of this review.

800
01:08:28,200 --> 01:08:33,480
And that would be, I think, minus 1 to 1.

801
01:08:33,480 --> 01:08:41,360
So this should be a positive review, basically 1 plus.

802
01:08:41,360 --> 01:08:50,120
So that would be between on a scale of 0 to 1.

803
01:08:50,120 --> 01:08:55,160
This review we would give a 0.65.

804
01:08:55,160 --> 01:09:05,120
So not the most favored cannabis product out there, but not the least.

805
01:09:05,120 --> 01:09:10,280
And so the idea is, OK, you can kind of combine these.

806
01:09:10,280 --> 01:09:18,040
So ask your consumers how they like the product and you could just ask them to write a preform

807
01:09:18,040 --> 01:09:19,040
review.

808
01:09:19,040 --> 01:09:22,320
And so that's what the cannabis diary is doing.

809
01:09:22,320 --> 01:09:27,420
There's essentially saying, hey, journal about your day.

810
01:09:27,420 --> 01:09:29,840
What did you consume today?

811
01:09:29,840 --> 01:09:32,080
Journal.

812
01:09:32,080 --> 01:09:33,080
Just journal.

813
01:09:33,080 --> 01:09:40,120
And so then we could actually do the sentiment analysis where, OK, here's your block of

814
01:09:40,120 --> 01:09:41,120
text.

815
01:09:41,120 --> 01:09:43,280
Was it a good day, bad day?

816
01:09:43,280 --> 01:09:51,040
And then you could correlate this to the actual, right?

817
01:09:51,040 --> 01:10:03,520
So for example, this review right here has, this has chemicals related to it.

818
01:10:03,520 --> 01:10:22,040
So this has 0.11% beta-pinene, so on and so forth, if you're interested in the THC.

819
01:10:22,040 --> 01:10:27,200
So that one's primarily delta 9 THC.

820
01:10:27,200 --> 01:10:31,640
That could just be how the lab measured that one.

821
01:10:31,640 --> 01:10:40,000
But anywho, let's see if this runs into an error here.

822
01:10:40,000 --> 01:10:45,160
If it does, we may need to save this.

823
01:10:45,160 --> 01:10:48,480
We may need to save the product recommendations for next week.

824
01:10:48,480 --> 01:10:49,480
My apologies.

825
01:10:49,480 --> 01:10:54,120
Just still have one last error to sort out.

826
01:10:54,120 --> 01:10:59,840
So it's been a long time coming, so bear with me for the setup.

827
01:10:59,840 --> 01:11:15,560
But that's sort of the idea in the nutshell is, you know, now that we have all this data

828
01:11:15,560 --> 01:11:27,200
here, so you have all of the cannabinoid concentrations and you have somebody's review, well, you

829
01:11:27,200 --> 01:11:37,960
know, now we have all of the pieces of the puzzle that we need to do product recommendations

830
01:11:37,960 --> 01:11:42,000
incorporating the consumer sentiment of each product.

831
01:11:42,000 --> 01:11:50,800
And so this, I think, would be a pro-consumer algorithm, right?

832
01:11:50,800 --> 01:11:55,760
Algorithms have just been getting a bad rap in the news lately.

833
01:11:55,760 --> 01:12:00,840
But you know, it's just a word, right?

834
01:12:00,840 --> 01:12:03,880
Algorithm, it's just a step of instructions.

835
01:12:03,880 --> 01:12:06,680
That's what it is in my book.

836
01:12:06,680 --> 01:12:14,640
And so this is a step of instructions that I think would benefit consumers, right?

837
01:12:14,640 --> 01:12:20,680
And I think this incorporates a critical piece that a cannabis data science member brought

838
01:12:20,680 --> 01:12:32,160
up in that not only do you just want to recommend consumers based on their history, but you

839
01:12:32,160 --> 01:12:40,600
also want to base it off of how well they reacted to each product.

840
01:12:40,600 --> 01:12:45,120
So just because they bought a product doesn't necessarily mean you want to recommend that

841
01:12:45,120 --> 01:12:46,120
product again.

842
01:12:46,120 --> 01:12:54,360
And in fact, if they didn't like that product, you definitely don't want to recommend that

843
01:12:54,360 --> 01:12:55,360
one again.

844
01:12:55,360 --> 01:13:02,280
So if they truly got a zero for a product, that wouldn't even be included in your algorithm.

845
01:13:02,280 --> 01:13:12,560
So the idea is theoretically, right, they should just keep recommending consumers products

846
01:13:12,560 --> 01:13:20,280
that they like more and more and more until I guess it would stabilize, right?

847
01:13:20,280 --> 01:13:25,880
So ideally, right, you'd hit your long term X bar.

848
01:13:25,880 --> 01:13:28,680
That's just your favorite strain, right?

849
01:13:28,680 --> 01:13:30,680
You finally figured it out, right?

850
01:13:30,680 --> 01:13:35,360
So you just tried a bunch of different strains in the store, wrote your review, got your

851
01:13:35,360 --> 01:13:37,740
product recommendation for the next one.

852
01:13:37,740 --> 01:13:41,200
And then you sort of gradually find the one you like the best.

853
01:13:41,200 --> 01:13:44,000
And then there you go.

854
01:13:44,000 --> 01:13:48,280
Then that would be your maxima.

855
01:13:48,280 --> 01:13:52,560
So we're basically using that great gradient descent, right?

856
01:13:52,560 --> 01:13:56,240
So then you reach your optimal point.

857
01:13:56,240 --> 01:13:58,840
So yeah, I just kind of thought about that now.

858
01:13:58,840 --> 01:14:06,360
But yeah, I think this would help consumers find their optimal product in the long run.

859
01:14:06,360 --> 01:14:11,760
But I'll have to think more about that.

860
01:14:11,760 --> 01:14:18,320
I would like to make two points as we close.

861
01:14:18,320 --> 01:14:27,160
There's one gaping hole still that we certainly at CESC are trying to fill.

862
01:14:27,160 --> 01:14:32,480
And we have an app that has been launched in proof of concept.

863
01:14:32,480 --> 01:14:38,120
And we're currently seeking funding to release version two.

864
01:14:38,120 --> 01:14:39,600
It's called the dosing project.

865
01:14:39,600 --> 01:14:42,400
And it's called dosing project for a reason.

866
01:14:42,400 --> 01:14:45,880
Because we're trying to come up with dose and amounts that are used.

867
01:14:45,880 --> 01:14:52,240
And unless you specifically query or figure out a mechanism to get that into this data

868
01:14:52,240 --> 01:14:56,480
series, you are missing a crucial piece of information.

869
01:14:56,480 --> 01:15:02,640
Maybe this was too strong and they overdosed or maybe they underdosed or whatever.

870
01:15:02,640 --> 01:15:09,200
So absent the dosing piece, you've got a hole.

871
01:15:09,200 --> 01:15:12,640
We're certainly working to try to fill that.

872
01:15:12,640 --> 01:15:24,100
The other major point I might make is in all of this, there's an inherent survey bias.

873
01:15:24,100 --> 01:15:31,280
If you don't like a product, you're not necessarily going to take the time to review it.

874
01:15:31,280 --> 01:15:39,040
Also it is possible we are seeing hints of this in the data that we analyze that certain

875
01:15:39,040 --> 01:15:42,200
strains may be associated with couch locked.

876
01:15:42,200 --> 01:15:45,600
You're not going to write a review because you're too couch locked.

877
01:15:45,600 --> 01:15:51,580
So interestingly, there may be a strain bias that causes this too.

878
01:15:51,580 --> 01:16:01,160
So one of the fixes may be not to have people rate, but to look for repeat purchase.

879
01:16:01,160 --> 01:16:07,560
And especially if in your COAs, you put unit coding into it.

880
01:16:07,560 --> 01:16:15,720
In other words, it's not just the lot numbers, not just what the batch is, but there's an

881
01:16:15,720 --> 01:16:20,440
actual merchandise ID sequential code.

882
01:16:20,440 --> 01:16:25,400
And we've done this at one point on a test launch of a product a couple of years ago.

883
01:16:25,400 --> 01:16:30,400
We put actual batch, actual item number counting.

884
01:16:30,400 --> 01:16:36,500
So if you're reporting or you review something or you put it in your diary or wherever you

885
01:16:36,500 --> 01:16:43,920
report it, but if you're looking for a repeat measure with a different lot number, that's

886
01:16:43,920 --> 01:16:49,720
a pretty good clue or a different item number.

887
01:16:49,720 --> 01:16:55,360
It's a pretty good clue that you're in favor of this product because you bought it again

888
01:16:55,360 --> 01:16:57,240
or you got it again.

889
01:16:57,240 --> 01:17:03,080
And that may be at the end of the day be the most potent indicator of product recommendation.

890
01:17:03,080 --> 01:17:06,000
Can I jump in?

891
01:17:06,000 --> 01:17:11,480
Because you basically have raised three good points and I'll address them in reverse order.

892
01:17:11,480 --> 01:17:13,440
That's how my memory works.

893
01:17:13,440 --> 01:17:20,280
The last point you picked up is basically what I was talking about is this is what economists

894
01:17:20,280 --> 01:17:25,680
like to talk about at lunch or statisticians or data scientists is what variables should

895
01:17:25,680 --> 01:17:26,680
you use?

896
01:17:26,680 --> 01:17:32,360
And so essentially I would say, okay, repeat purchase, that's a variable.

897
01:17:32,360 --> 01:17:36,880
So you could almost use that as a dummy variable, zero, one.

898
01:17:36,880 --> 01:17:40,880
If somebody's purchased it before, it gets a value of one.

899
01:17:40,880 --> 01:17:45,200
If they've never purchased it before, you get a value of zero.

900
01:17:45,200 --> 01:17:47,120
Include that variable in your model.

901
01:17:47,120 --> 01:17:54,240
And as you said, it may be quite an effective predictor.

902
01:17:54,240 --> 01:17:58,560
Like I said, it can enter your model in more sophisticated ways than that.

903
01:17:58,560 --> 01:18:04,680
Maybe it enters in the averaging somehow.

904
01:18:04,680 --> 01:18:09,440
So that was that one point, which was brilliant.

905
01:18:09,440 --> 01:18:15,640
The second point was me just kind of defending the work is the reviews we have are quite

906
01:18:15,640 --> 01:18:16,640
flawed.

907
01:18:16,640 --> 01:18:24,080
And as you said, we actually encounter sampling bias because not everybody loves to review.

908
01:18:24,080 --> 01:18:29,120
And so that's where that kind of hits on the third point is that's really where the value

909
01:18:29,120 --> 01:18:32,080
added comes from the dosing project.

910
01:18:32,080 --> 01:18:42,200
I can't even disagree, maybe an app you're working on is marrying the product data with

911
01:18:42,200 --> 01:18:47,120
the test results and the review too.

912
01:18:47,120 --> 01:18:50,200
And so, John, you're doing that in a clinical study.

913
01:18:50,200 --> 01:18:53,600
And so that way you're getting everybody.

914
01:18:53,600 --> 01:18:58,480
So that way you don't have sampling bias.

915
01:18:58,480 --> 01:19:06,920
So to your solving problem two by number one, which is what you're doing.

916
01:19:06,920 --> 01:19:11,680
So that gets to the third point, your first point, which was the dosing project.

917
01:19:11,680 --> 01:19:15,520
And that's where I think you're adding a lot of value.

918
01:19:15,520 --> 01:19:25,440
You're marrying the or contextualizing, if you want the more formal word, contextualizing

919
01:19:25,440 --> 01:19:33,840
the product with the lab result with the consumers outcome, so to speak.

920
01:19:33,840 --> 01:19:44,280
So their effects, sentiment, well-being, utility, whatever word, enjoyment, whatever word you

921
01:19:44,280 --> 01:19:48,660
really want to use or what you're measuring.

922
01:19:48,660 --> 01:19:55,720
Maybe this is a medical patient and they've got a very quantifiable outcome.

923
01:19:55,720 --> 01:19:58,920
So as I said, the sky's the limit.

924
01:19:58,920 --> 01:20:00,680
And that's what's fun about this.

925
01:20:00,680 --> 01:20:03,440
We're just unlocking data.

926
01:20:03,440 --> 01:20:06,240
And you're just using tools, right?

927
01:20:06,240 --> 01:20:10,480
So that's why I was saying everything came together.

928
01:20:10,480 --> 01:20:14,760
Next week I'll finally wrap it up and actually give you some product recommendations.

929
01:20:14,760 --> 01:20:20,000
So maybe we'll just do that first thing next week and not do a bunch of fancy stuff.

930
01:20:20,000 --> 01:20:23,720
But now we have all the fancy stuff in our tool belt, right?

931
01:20:23,720 --> 01:20:28,280
If any of you need help with the optical character recognition, reach out to me because this

932
01:20:28,280 --> 01:20:34,520
is something that you should be able to get installed on your computer and just be able

933
01:20:34,520 --> 01:20:42,000
to regularly use, which is awesome because who knows what cool uses you'll think of for

934
01:20:42,000 --> 01:20:43,000
it.

935
01:20:43,000 --> 01:20:49,560
And then we just did our demonstration of how you can wrangle this data in the Canvas

936
01:20:49,560 --> 01:20:51,600
industry.

937
01:20:51,600 --> 01:20:56,680
So I'll let all of your minds chew on this.

938
01:20:56,680 --> 01:20:59,200
I know I've taken up a lot of your time.

939
01:20:59,200 --> 01:21:06,080
So I just want to thank you all for coming because it's you and your ears that help advance

940
01:21:06,080 --> 01:21:07,200
all of this, right?

941
01:21:07,200 --> 01:21:13,000
So it's us bouncing ideas off of each other, working with each other.

942
01:21:13,000 --> 01:21:18,960
So I want to give a big thanks to Canvas who really tested all of this data parsing.

943
01:21:18,960 --> 01:21:23,000
So the reason we have all this fun data to look at today is because of Canvas.

944
01:21:23,000 --> 01:21:25,040
So big thanks to Canvas.

945
01:21:25,040 --> 01:21:26,040
Thank you, Canvas.

946
01:21:26,040 --> 01:21:31,640
And John, too, for really getting this ball rolling.

947
01:21:31,640 --> 01:21:37,360
Because actually, right, I'm taking the data now and I'm going to start looking at the

948
01:21:37,360 --> 01:21:42,400
ratios between the lemmonine and the beta-pinene.

949
01:21:42,400 --> 01:21:43,920
And it's all really great.

950
01:21:43,920 --> 01:21:46,160
It's very exciting what Canalytics is doing.

951
01:21:46,160 --> 01:21:48,240
Be careful on that.

952
01:21:48,240 --> 01:21:53,120
You're talking about for the raw gardens?

953
01:21:53,120 --> 01:21:54,120
Yes.

954
01:21:54,120 --> 01:21:55,120
Yeah.

955
01:21:55,120 --> 01:21:59,840
Be careful because that comprises a number of different product types.

956
01:21:59,840 --> 01:22:01,360
And it's a mishmash.

957
01:22:01,360 --> 01:22:07,080
We have to subset by product to make sense out of this.

958
01:22:07,080 --> 01:22:11,280
Don't go overboard on the analysis of that until we've done that.

959
01:22:11,280 --> 01:22:12,800
Oh, right.

960
01:22:12,800 --> 01:22:18,760
I was just putting it into a data frame and then selecting the columns and then x, y,

961
01:22:18,760 --> 01:22:20,240
and then plot.

962
01:22:20,240 --> 01:22:21,720
That's all I was kind of doing.

963
01:22:21,720 --> 01:22:22,720
Yeah.

964
01:22:22,720 --> 01:22:27,020
Unless we have subtypes, it's kind of meaningless.

965
01:22:27,020 --> 01:22:31,880
It was more for a reality check just so that we could check all this new data and all this

966
01:22:31,880 --> 01:22:35,160
OCR parsed PDFs.

967
01:22:35,160 --> 01:22:37,520
It's just crazy amazing.

968
01:22:37,520 --> 01:22:39,520
And so we have so much more data.

969
01:22:39,520 --> 01:22:47,040
So I was just going to rerun the same plots, just ratio, not even using like K&N or anything

970
01:22:47,040 --> 01:22:49,600
fancy.

971
01:22:49,600 --> 01:22:56,360
May I encourage you to use the beta-karyophylline-alpha-uniline ratio instead of the beta-pinene-limonine

972
01:22:56,360 --> 01:22:57,360
ratio?

973
01:22:57,360 --> 01:23:00,000
Oh, I'm on the hook doing both.

974
01:23:00,000 --> 01:23:01,000
Okay.

975
01:23:01,000 --> 01:23:02,000
Excellent.

976
01:23:02,000 --> 01:23:03,000
Because I think that's...

977
01:23:03,000 --> 01:23:06,000
You've got the slack today.

978
01:23:06,000 --> 01:23:07,000
Okay.

979
01:23:07,000 --> 01:23:16,080
And just in general, are you guys comfortable now with the Tesseract and PNG pairing?

980
01:23:16,080 --> 01:23:23,280
That's interesting news for me, if that's the case.

981
01:23:23,280 --> 01:23:26,280
It's pretty well ironed out.

982
01:23:26,280 --> 01:23:29,720
Like I said, we haven't tried any images in the wild.

983
01:23:29,720 --> 01:23:31,280
So that's the next step is...

984
01:23:31,280 --> 01:23:37,280
Images in the wild are not going to be all that current because most people don't get

985
01:23:37,280 --> 01:23:39,960
access to paper copies of this.

986
01:23:39,960 --> 01:23:44,080
Well, ask your retailer next time you're there.

987
01:23:44,080 --> 01:23:48,840
And then like I said, it's good that you're doing a reality check, Candice, just making

988
01:23:48,840 --> 01:23:51,240
sure we've got numbers in that.

989
01:23:51,240 --> 01:23:53,840
I think the image is the more important one.

990
01:23:53,840 --> 01:23:58,920
Well, actually Keegan, it was your suggestion to look for outliers.

991
01:23:58,920 --> 01:23:59,920
True.

992
01:23:59,920 --> 01:24:03,080
Give yourself a hand.

993
01:24:03,080 --> 01:24:06,080
It is important.

994
01:24:06,080 --> 01:24:11,520
And in fact, I'll leave you with two bits of nuggets.

995
01:24:11,520 --> 01:24:16,000
The first is I'll make the code available online so that way you can continue exploring

996
01:24:16,000 --> 01:24:20,780
and maybe fix this product recommendation and beat me to the punch.

997
01:24:20,780 --> 01:24:25,680
And then the two is I know I've got about this on Saturday Morning Statistics is outliers

998
01:24:25,680 --> 01:24:31,400
are surprisingly interesting.

999
01:24:31,400 --> 01:24:34,320
We talked about those with patents.

1000
01:24:34,320 --> 01:24:39,360
So a lot of times you may exclude outliers, but if you're doing a patent hunt, then you

1001
01:24:39,360 --> 01:24:43,520
want to find the outliers.

1002
01:24:43,520 --> 01:24:46,380
So they're always worth looking at.

1003
01:24:46,380 --> 01:24:48,840
Maybe there's something going on there.

1004
01:24:48,840 --> 01:24:54,320
So anywho, we've taken up a lot of time, so I definitely kind of want to get you all out

1005
01:24:54,320 --> 01:24:55,320
of here today.

1006
01:24:55,320 --> 01:25:00,200
You're respectful of your time, but I love the enthusiasm.

1007
01:25:00,200 --> 01:25:04,960
So please keep it up and I'll share all the code with you so that way you can go get your

1008
01:25:04,960 --> 01:25:15,800
hands on it all explore and we can reconvene next week and we'll at least do the consumer

1009
01:25:15,800 --> 01:25:21,200
recommendations and then we'll have time to do some sort of fun other data science exercise.

1010
01:25:21,200 --> 01:25:24,080
So please let me know if you want something.

1011
01:25:24,080 --> 01:25:27,040
Otherwise, I'll think of the funnest thing I can.

1012
01:25:27,040 --> 01:25:49,640
And we can have a fun time crunching numbers again.

