1
00:00:00,000 --> 00:00:12,200
Welcome to cannabis data science. You're in for a particular treat today. You know, as

2
00:00:12,200 --> 00:00:20,960
we've been at this for a long time now, just slow and methodically collecting data, standardizing

3
00:00:20,960 --> 00:00:30,320
it, looking at it, asking interesting research questions, trying to draw insights. And we try

4
00:00:30,320 --> 00:00:37,120
to follow the demand. We try to listen, keep our ears to the ground, hear what statistics are

5
00:00:37,120 --> 00:00:43,240
people demanding, and then try to put those statistics and data into people's hands because

6
00:00:43,240 --> 00:00:49,400
we think that's our comparative advantage and the way that we can help people out in the cannabis

7
00:00:49,400 --> 00:00:59,040
space. I'll share with you a lot of new ground, a lot of work that can be done, some exciting new

8
00:00:59,040 --> 00:01:04,240
developments, and of course, we'll just have a fun day crunching cannabis data at the end of the day.

9
00:01:04,240 --> 00:01:09,960
So that's what I have to share with you today. We'd love to hear about some of your adventures.

10
00:01:09,960 --> 00:01:16,880
It's a meetup after all, so you know, feel free to share some of your interests. So I'll start,

11
00:01:16,880 --> 00:01:24,760
so I don't put anyone on the spot. I'll start with my intrepid co-host, Candice, who's been

12
00:01:24,760 --> 00:01:32,120
helping me wrangle data for a long time now. She's got an impressive technical setup. So I'll pass

13
00:01:32,120 --> 00:01:37,960
it off to you, Candice, anything you'd like to share? Any of your thoughts you want to put on

14
00:01:37,960 --> 00:01:46,840
the table? Well, it's been a fun week. I ran Get Results or Get Licenses for Florida. That was

15
00:01:46,840 --> 00:01:54,400
pretty cool. Came up with real nice CSV. Thank you, Keegan. And also to Get Results Florida,

16
00:01:54,400 --> 00:02:04,720
I'm getting those as well. I have 17.3 gigabyte of COAPDS, and actually, I made some changes in the

17
00:02:04,720 --> 00:02:11,960
code that where I resulted with Cannot Save File into a Non-Existent Directory Parent, but I know

18
00:02:11,960 --> 00:02:20,840
where that is. And otherwise, yeah, it's, you know, parsing and putting into a results data set. And

19
00:02:20,840 --> 00:02:27,640
I'm really excited because it's a lot of information for Florida. And I have been trying to obtain it by

20
00:02:27,640 --> 00:02:35,440
purchasing product and, you know, with Facebook group medical patients in Florida that were

21
00:02:35,440 --> 00:02:44,640
sharing COAs. So I'm really excited about this. And that and the metric API, I'm getting closer to

22
00:02:44,640 --> 00:02:52,120
that. I'm just waiting now. I filled out everything, the agreement. And as far as the Massachusetts

23
00:02:52,120 --> 00:02:59,320
Freedom of Information Act, I am going to pursue it. You know, because, you know, like, CCC does

24
00:02:59,320 --> 00:03:05,200
have the records, they have it in a database, and all they have to do is just publish it to a CSV.

25
00:03:05,200 --> 00:03:11,800
file instead of printing it out. But they're saying that they don't have the records. And I don't

26
00:03:11,800 --> 00:03:18,160
know, you know, I, I could be mistaken, right. But I just want to ask them, you know, about that. I

27
00:03:18,160 --> 00:03:24,520
really don't understand why it can't be done, or it can't be done by other states. But that's it.

28
00:03:25,920 --> 00:03:28,040
And hi, everybody.

29
00:03:28,040 --> 00:03:38,120
I absolutely love it. Because that's sort of the lesson of the day is ask and you shall receive.

30
00:03:38,120 --> 00:03:50,480
And just the act of asking, as we've discussed in the past, and I'll drill home today, adds value,

31
00:03:51,040 --> 00:03:56,880
right. And I always say, right, one of the lessons in the past was, you know, it doesn't hurt to ask.

32
00:03:56,880 --> 00:04:03,560
So, as Candice was saying, you know, she's at least asking for lab results for some of these cannabis

33
00:04:03,560 --> 00:04:11,400
products in her state, because those are data points that she's interested in. And Candice isn't

34
00:04:11,400 --> 00:04:18,840
alone. There's a lot of people out there searching for this data. I'll actually piggyback on a lot of

35
00:04:18,840 --> 00:04:26,920
the things Candice said momentarily. But I'll have a lot to say on that. So while I'm, you know,

36
00:04:26,920 --> 00:04:32,680
formalizing my thoughts, I'll go ahead and let everybody have a chance to speak. So, it's

37
00:04:32,680 --> 00:04:41,320
phenomenal to see you. You've been on my mind as we've got some new data science algorithms to

38
00:04:41,320 --> 00:04:49,120
write, and Canlytics is going to be looking for some of its first hires in the near future. So,

39
00:04:49,120 --> 00:04:57,640
I'm going to be getting everybody formally contacted once I get all the job postings listed. As I will

40
00:04:57,640 --> 00:05:08,800
share with you today, I've been sprinting on something interesting. I can go back to just a nice

41
00:05:08,800 --> 00:05:20,480
methodical jog at this point, and try to get some of you awesome people involved. So, but what's on

42
00:05:20,480 --> 00:05:26,240
your mind or plate or any thoughts you want to put on the table? I'm doing a data science boot camp

43
00:05:26,240 --> 00:05:32,040
right now. Well, hi, my name is Hector. I sporadically show up in these meetings every once in a while.

44
00:05:32,040 --> 00:05:37,960
I'm doing a data science boot camp. And right now we're on break, we just finished our second capstone.

45
00:05:37,960 --> 00:05:53,440
And I did a project on identifying candidates who were going to go bankrupt. And so I ran several

46
00:05:53,440 --> 00:06:03,440
machine learning algorithms, including random forest and XGBoost. And I was able to get some

47
00:06:03,440 --> 00:06:12,640
pretty good results from running my models. I ran PCA, I did a whole bunch of stuff. It was a lot

48
00:06:12,640 --> 00:06:20,480
of fun. I look forward to being able to use my skills even further in the future. Well, in the

49
00:06:20,480 --> 00:06:27,960
past, remember, we poked around at survival analysis. I think we were looking at Washington

50
00:06:27,960 --> 00:06:35,120
State, but we've got traceability data, and even lab results, it contains a surprising amount of

51
00:06:35,120 --> 00:06:44,640
information. And one thing is just who's operating. So just from say lab results alone, you can kind of

52
00:06:44,640 --> 00:06:53,200
get a pulse of who's going to the lab, how many samples are they sending in. But in Washington

53
00:06:53,200 --> 00:06:59,520
State, we've got full traceability data, so we can even look at sales. And so we were saying, Oh, you

54
00:06:59,520 --> 00:07:04,920
know, we're, we're trying to look at people's sales over time. And we noticed people would drop out of

55
00:07:04,920 --> 00:07:10,840
the market. And of course, people would enter the market. So what you could potentially do is maybe

56
00:07:10,840 --> 00:07:18,280
you could adapt your bankruptcy model to the Washington State data, and almost have a risk

57
00:07:18,280 --> 00:07:25,400
assessment model, or slash survival model. So maybe you could apply it to licensees there. And yeah,

58
00:07:25,400 --> 00:07:34,360
maybe some of the licensees, they're going along steady, they're at low risk of bankruptcy, or, or

59
00:07:34,360 --> 00:07:42,000
falling out of the market. But maybe you can identify people who are at high risk of exiting,

60
00:07:42,000 --> 00:07:49,600
and they may not even realize it. So you may be able to send them an email and just say, Hey, some

61
00:07:49,600 --> 00:07:57,160
of your key performance indicators aren't looking that great, or I've got this risk assessment model,

62
00:07:57,160 --> 00:08:04,400
and you know, maybe you were flagged. And, you know, maybe we could, you know, work together to try to

63
00:08:04,400 --> 00:08:15,840
figure out some ways to, to, to lower your risk of exit. So, but I'm just brainstorming, I'm sure you

64
00:08:15,840 --> 00:08:25,880
can think of, you know, many more uses or even a better, better application than that. But I love

65
00:08:25,880 --> 00:08:30,400
that you're, you're keeping your nose to the grindstone and working on your skills. That's important.

66
00:08:30,400 --> 00:08:43,760
So, Robert, good to see you. I think we were corresponding about some, some work to be done. And

67
00:08:43,760 --> 00:08:50,080
I've started to share some of the, the data collection efforts, slash standardization efforts,

68
00:08:50,080 --> 00:08:55,880
even the AI efforts that we're working on in Slack, and so on and so forth. So please, please,

69
00:08:55,880 --> 00:09:01,160
everyone get in touch with me if you need an invite to the Slack channel or anything. But how

70
00:09:01,160 --> 00:09:10,400
is your work going, Robert? Good, yes, similar to Hector. I'm also in a data science boot camp and

71
00:09:10,400 --> 00:09:18,320
finished up Capstone 2 projects where I did something somewhat similar. And that was looking

72
00:09:18,320 --> 00:09:30,560
at a credit card, credit scoring classification model where I used about six different classification

73
00:09:30,560 --> 00:09:40,520
algorithms. And then before that, I also had a fair amount of data scrubbing and standardizing

74
00:09:40,520 --> 00:09:49,520
to do. And, you know, most, or in the past, as a SQL developer, I would, you know, not be crazy

75
00:09:49,520 --> 00:09:56,360
about the idea of data scrubbing, but it was a good chance to practice Python. So I was pretty

76
00:09:56,360 --> 00:10:06,720
happy or totally fine with running a bunch of different Python scripts to get clean CSV files

77
00:10:06,720 --> 00:10:15,240
that would be able to import into a SQL database without generating errors and so on, import errors.

78
00:10:15,240 --> 00:10:24,200
And then similarly, I'm, you know, these days, super enthused with, you know, what AI and chat

79
00:10:24,200 --> 00:10:33,640
GPT and what Bard can do. Very interesting stuff going on right now with all this AI that's out

80
00:10:33,640 --> 00:10:41,040
there. So yeah, that's me, you know, studying hard and I'll circle back to you on getting some of

81
00:10:41,040 --> 00:10:51,680
those data files and how I can assist. Well, once again, Robert, going to tie in your interests as

82
00:10:51,680 --> 00:10:59,440
well, because I always like to explore all the latest and greatest tools. I always caution people,

83
00:10:59,440 --> 00:11:05,760
you know, don't put all your eggs in, in one basket. You know, in the past, we talked about,

84
00:11:05,760 --> 00:11:11,520
you know, some of the blockchain technologies. And while I still think those are awesome

85
00:11:11,520 --> 00:11:18,880
technologies, and to a certain extent, we use little pieces of them, we use hashing. And as I

86
00:11:18,880 --> 00:11:23,280
was talking about, like there's ways that we can start incorporating more of those ideas into our

87
00:11:23,280 --> 00:11:32,320
work. But I think the slow approach is good, just slow adoption. As I said, don't just want to rush

88
00:11:32,320 --> 00:11:39,440
full steam into it. So I'm a little careful with AI in that I don't necessarily want to rush full

89
00:11:39,440 --> 00:11:46,400
steam into it. But at the same time, I think it would be negligent to not explore it. It's a

90
00:11:46,400 --> 00:11:52,720
awesome new technology, everyone's raving about it. So shouldn't we understand it fairly well?

91
00:11:53,360 --> 00:12:04,720
And that's truly, as I'll show you today, it can be interesting to help us do well estimation.

92
00:12:04,720 --> 00:12:12,480
Remember, statistics at the end of the day is about prediction. So today, I'll show you how we can

93
00:12:12,480 --> 00:12:20,400
use the AI model for predictive purposes. So we may not want to say that this is the definitive

94
00:12:20,400 --> 00:12:28,320
answer. But sometimes, remember, when we were doing forecasting, I was saying, you know, any

95
00:12:28,320 --> 00:12:35,760
forecast is better than no forecast. And so in this case, it will be any look at the data is better

96
00:12:35,760 --> 00:12:42,960
than no look at the data. So that's a teaser to really, really exciting things to share with you

97
00:12:42,960 --> 00:12:51,520
momentarily. But before I get to that, Rick, thrilled to see you today. Going to be picking

98
00:12:51,520 --> 00:12:57,760
up with some of the topics from last week, in particular, Dino hunting. So that's sort of the

99
00:12:57,760 --> 00:13:02,640
the theme of the day today. So we'd love to hear about some of your thoughts, or ideas that you

100
00:13:02,640 --> 00:13:09,040
want to put on the table. Yeah, I'm excited to see what you've come up with for today. It was a

101
00:13:09,040 --> 00:13:17,280
great conversation last week. So hi, everyone. My name is Rick. Right now, my main focus is AI. I've

102
00:13:17,280 --> 00:13:26,320
been working not so much with collecting number data, but more text data. This is I don't know if

103
00:13:26,320 --> 00:13:32,480
you consider it like tribal knowledge or what, but in terms of cultivating, I think it's a

104
00:13:32,480 --> 00:13:40,720
great idea to be collecting, you know, history about genetics and where they came from and trying

105
00:13:40,720 --> 00:13:48,400
to track origins of specific expressions and plants, so on and so forth. There's not a huge

106
00:13:48,400 --> 00:13:54,240
database available for collecting a lot of that there is books and stuff. So I've essentially

107
00:13:54,240 --> 00:14:00,240
just been scraping all of that data from various forums like Reddit, there's a lot of good

108
00:14:00,240 --> 00:14:07,920
things there. And then running those through an embedding process and then storing it in a vector

109
00:14:07,920 --> 00:14:15,920
database for retrieval for AI. So you just kind of throw like an AI chat bot on top of that and it

110
00:14:15,920 --> 00:14:22,960
can query the database and get really accurate responses after, you know, a minimal amount of

111
00:14:22,960 --> 00:14:29,680
fine tuning. It's got a lot of moving pieces and it's hard to keep up with because things change

112
00:14:29,680 --> 00:14:35,920
and I haven't used it for numbers yet. Gigan and I suspect it's because of a lot of your cautions

113
00:14:35,920 --> 00:14:41,440
as well. But for text and stuff like growing or helping someone that has a cultivation question,

114
00:14:41,440 --> 00:14:45,440
it's working pretty well right now. So that's what I'm working on.

115
00:14:46,960 --> 00:14:54,640
I absolutely love it, Rick, and hopefully we can add some tools to your tool belt today and, you

116
00:14:54,640 --> 00:15:01,280
know, just kind of, you know, bounce good ideas off of each other because a lot of the work we're

117
00:15:01,280 --> 00:15:13,200
doing is complementary. So absolutely love it. Excited to see, you know, how you develop

118
00:15:13,200 --> 00:15:18,240
and how your projects grow. So keep us up to date on these. So this is exciting work that you're doing.

119
00:15:18,240 --> 00:15:24,480
It's right in the vein of everything we're doing today. And it's funny that you once again bring

120
00:15:24,480 --> 00:15:31,680
up Reddit because I'll tie in some of the work you're doing today too. So simply phenomenal.

121
00:15:31,680 --> 00:15:38,240
Simply phenomenal. How about you, Anna? Welcome to the group. Well, you've been to the group. So

122
00:15:38,240 --> 00:15:43,840
good to see you today. Thrilled to see you. We'd love to hear about some of the things that are

123
00:15:43,840 --> 00:15:48,800
on your plate and any ideas that you want to put on the table. Oh my gosh. I haven't been working

124
00:15:48,800 --> 00:15:53,440
on anything exciting. I've just been doing a lot of housekeeping. So setting up the

125
00:15:54,720 --> 00:16:02,800
external hard drive and just my file system and all that. But I am finding that there's not a whole

126
00:16:02,800 --> 00:16:10,240
lot of data for Oregon to scrape off the web. And so I've still been looking around. I've

127
00:16:10,240 --> 00:16:17,200
guess, yeah, I'm still a little bit lost on that part. And I am looking for other people from

128
00:16:17,200 --> 00:16:24,240
Portland or Oregon in general. So yeah, I may be shifting off of the medical focus, even though

129
00:16:24,240 --> 00:16:29,520
that's always where I want to put my efforts. But right now I'm a little distracted with some

130
00:16:29,520 --> 00:16:36,240
of the stuff going on in the policy arena. So yeah, but I don't really have anything very

131
00:16:36,240 --> 00:16:41,760
interesting to present just puttering along still. But I just wanted to reconnect for sure. So

132
00:16:42,400 --> 00:16:48,400
thank you. I'm in good company, it looks like. Well, Anna, I think we can give you some rays of

133
00:16:48,400 --> 00:17:00,480
sunshine here because I think we can provide or you can maybe even Kindle, Spark up a way to

134
00:17:00,480 --> 00:17:09,600
actually get some of this Oregon data. And once again, it all begins, as we said, with asking.

135
00:17:09,600 --> 00:17:16,480
So I'll go ahead and share my screen here, but you're definitely on the right path. And as I

136
00:17:16,480 --> 00:17:27,120
tell everybody, you know, even if you're just able to spend 20 minutes a day, just doing some sort of

137
00:17:27,120 --> 00:17:33,440
data science thing, right, even if that's just setting up, like you said, your external hard drive,

138
00:17:33,440 --> 00:17:42,640
that's still moving the ball forward. And these external hard drives, I've gone, this is my third

139
00:17:42,640 --> 00:17:49,360
one now. So be really careful with them. And because they're really, they're actually kind of

140
00:17:49,360 --> 00:17:57,520
fragile. And so like I did, you drop it on the floor, it's, it's gone, I'm going to try to do like

141
00:17:57,520 --> 00:18:06,320
some some data recovery. Once, once again, my philosophy is, you know, archive, archive, archive.

142
00:18:06,320 --> 00:18:12,720
So you want to have data stored in multiple places. And that's why, yes, you know, it's phenomenal

143
00:18:12,720 --> 00:18:19,280
for me to have it on my external hard drive. And I think that's a great way to do that. And

144
00:18:19,280 --> 00:18:25,280
you know, if you can afford it, to put it in the cloud, you can do that. But you know, cloud prices

145
00:18:25,280 --> 00:18:37,280
can be non negligible. And then distribute, that's the our third source of resiliency. So if I drop

146
00:18:37,280 --> 00:18:45,200
my hard drive with a bunch of this lab results, you know, on the floor, like I did, that's okay,

147
00:18:45,200 --> 00:18:53,200
because all of you awesome data scientists are also archiving data. So you know, as Candice has

148
00:18:53,200 --> 00:19:00,640
mentioned, she's beginning her archive of lab results. And I would encourage all of you to, to

149
00:19:00,640 --> 00:19:06,720
archive data, because just, you know, maybe I dropped my hard drive on the floor, Candice drops

150
00:19:06,720 --> 00:19:15,200
her hard drive on the floor. But, you know, thankfully, Anna has has a copy. So, you know, you

151
00:19:15,200 --> 00:19:22,400
just never know when your copy is going to pay off or be useful to someone. And Anna, you also

152
00:19:22,400 --> 00:19:33,680
mentioned medical patients. Well, as Candice was mentioning in the beginning, in Florida, there's

153
00:19:33,680 --> 00:19:45,120
medical cannabis, and look at this. So we were telling people that, you know, it doesn't hurt to

154
00:19:45,120 --> 00:19:52,880
ask, you know, you're going to be going to a dispensary, I think, you know, that if you're

155
00:19:52,880 --> 00:19:59,920
consuming a product, it's pertinent to know what you're consuming. So it wouldn't hurt to ask for

156
00:19:59,920 --> 00:20:06,960
the certificate. I don't think in Florida, they're mandated to provide them. In some states, they are

157
00:20:06,960 --> 00:20:15,600
like Washington State. And in some states, they're maybe not mandated, but people will provide them.

158
00:20:15,600 --> 00:20:21,760
So I've seen them in Massachusetts before from just asking at a dispensary, and I don't think

159
00:20:21,760 --> 00:20:31,920
it's mandated there. But this is pretty typical, right? And as I said, don't get mad, right? So

160
00:20:34,640 --> 00:20:40,400
just because it's okay to ask doesn't necessarily mean they have to say yes. So, you know, you may

161
00:20:40,400 --> 00:20:47,360
get a no, and you have to kind of just be okay with that. So here they say, oh, you know, they said,

162
00:20:47,360 --> 00:20:55,760
oh, the printer was offline. And once again, that may be perfectly the that may actually be the case

163
00:20:55,760 --> 00:21:07,040
because printers are notoriously fickle. But this is my point is, if you go around asking, well,

164
00:21:07,040 --> 00:21:16,800
so that was three months ago. Well, eight days ago, look at this, somebody made a post, you know,

165
00:21:16,800 --> 00:21:22,880
Dita can finally put their CEO is up. And so they said they thought, oh, you know, they never thought

166
00:21:22,880 --> 00:21:30,400
it would happen. But they're like, oh, this is amazing. You know, I can now shop off of turpies.

167
00:21:32,640 --> 00:21:39,360
And in like, this is another thing is, you know, you shouldn't have to look at the COA

168
00:21:39,360 --> 00:21:46,160
after you purchased it, right? Because ideally, you want to know what you're consuming,

169
00:21:46,160 --> 00:21:56,000
you know, before you purchase it. And so I'll let you explore this. But, you know, now people

170
00:21:56,000 --> 00:22:03,680
are saying like, oh, look, you know, here, I bought this product. And they're saying, oh, you know,

171
00:22:04,480 --> 00:22:12,880
what's up with the THC values? And then, you know, people chime in, in the comments, you know, kind of,

172
00:22:12,880 --> 00:22:19,600
you know, having a conversation about it. And I think this is this is brilliant, right? Because

173
00:22:20,320 --> 00:22:27,760
this is kind of reducing some of the the murkiness that was in the market, right? People were

174
00:22:27,760 --> 00:22:33,680
kind of, you know, getting hung up on these THC numbers. But now they're actually kind of able to

175
00:22:33,680 --> 00:22:44,000
see like, oh, like, there's more to it. Then, you know, there's more to this than just total THC.

176
00:22:44,000 --> 00:22:50,160
You know, we've got all the the other cannabinoids. And oh, you know, what's going on, you know, here

177
00:22:50,160 --> 00:23:01,600
with with turpines? And sure enough, so here's somebody saying, like, oh, like, so I think I like

178
00:23:01,600 --> 00:23:08,640
terpenes, right? So and they're saying, you know, oh, you know, what's the deal with with barnesine,

179
00:23:08,640 --> 00:23:16,480
which, which is a terpene they test for in in Florida. And so once again, you know, this person

180
00:23:16,480 --> 00:23:23,440
just finally, you know, saw some COAs, and they're just asking, like, okay, like, they didn't even

181
00:23:23,440 --> 00:23:29,120
know about terpenes. And they're like, you know, can someone just please explain terpenes to me?

182
00:23:29,120 --> 00:23:35,440
And, you know, sure enough, you know, once again, you've got to be cautious about things people tell

183
00:23:35,440 --> 00:23:42,800
you, right, you want to, you know, check it for factual accuracy and everything. But this is still

184
00:23:42,800 --> 00:23:51,680
a conversation. And so that's the point is, ask a question, and it can, you know, get the conversation

185
00:23:51,680 --> 00:24:01,600
going. They asked the question about COAs got the conversation going. You know, they posted a COA,

186
00:24:02,480 --> 00:24:09,360
you know, that got the conversation going to terpenes. And then, you know, and then now, Rick,

187
00:24:09,360 --> 00:24:16,880
now this is sort of up your vein. People are now saying like, okay, you know, is there any way we

188
00:24:16,880 --> 00:24:23,440
can start, you know, accumulating some of these COAs, so that way we can kind of understand them

189
00:24:23,440 --> 00:24:35,600
as a whole. So for example, if you see beta pionene at 0.2%, is that a lot? Or is that a little? Well,

190
00:24:35,600 --> 00:24:41,280
as we've seen, if you actually look at the distribution, you can find out if something

191
00:24:41,280 --> 00:24:48,560
is, you know, above average or below average, and that's the useful data point. So that's what

192
00:24:48,560 --> 00:24:58,080
people are kind of after. And once again, the Rick, these aren't finished projects. And so this is

193
00:24:58,080 --> 00:25:05,280
what I call a demand for a product, right? So these are people saying that, oh, you know, we want

194
00:25:05,280 --> 00:25:14,880
some COAs, you know, we want a COA database. Here's somebody who's got a research question. So they say,

195
00:25:14,880 --> 00:25:23,440
you know, oh, you know, what's going on with with Farnes? And now finally, you know, somebody just

196
00:25:23,440 --> 00:25:31,680
asks the question, you know, can we just get a, you know, a universal COA search site? Well,

197
00:25:31,680 --> 00:25:43,440
remember at the beginning, ask, and you shall receive. So this is, you know, just the Reddit,

198
00:25:43,440 --> 00:25:51,360
on Reddit was just the crude origins, you know, to this, you know, very, very day, or a couple days

199
00:25:51,360 --> 00:25:59,120
ago, you know, people are just posting up their results. And, you know, you can get the the various

200
00:25:59,120 --> 00:26:08,640
you know, you can start looking at some of these COAs. And, you know, so what I realized is, oh,

201
00:26:08,640 --> 00:26:18,160
well, they have, you know, QR codes on them. So what you can do is scan the QR code on your URL,

202
00:26:18,160 --> 00:26:25,520
or I mean, scan the QR code on your phone, or with with Python, or however you please,

203
00:26:25,520 --> 00:26:35,600
and you know, you can now get this. And actually, here, we can actually do this in Python real quick.

204
00:26:35,600 --> 00:26:42,800
So to just show you how you can actually get the COA URL off of this thing. And, you know,

205
00:26:42,800 --> 00:26:50,880
you can use CANlytics. And our favorite COA parsing tool, and you can use the

206
00:26:50,880 --> 00:27:04,400
COA doc. And then we can find the COA URL over here. So I think we can do this without too much

207
00:27:04,400 --> 00:27:11,520
trouble. Hopefully, if it's too much trouble, I'll just pass it. But we can just say, oh, let's find

208
00:27:11,520 --> 00:27:19,440
the COA URL. And then we can find the COA URL over here. So I think we can do this without too much

209
00:27:19,440 --> 00:27:29,840
trouble. And then we can just say, oh, let's find this QR code. And so this just checks all the images

210
00:27:29,840 --> 00:27:39,200
and sees if any of them are QR codes, and if there's a parsable URL from them. And so once again,

211
00:27:39,200 --> 00:27:46,000
not super fast. So if any of you want to get in there, and also I'm running a bunch of processes

212
00:27:46,000 --> 00:27:58,400
now. But long story short, you know, you can start to, you know, find the direct source of these

213
00:27:59,600 --> 00:28:10,960
COAs. And basically, upon, you know, further exploration, you know, you can basically find

214
00:28:10,960 --> 00:28:24,080
that these various laboratories are publishing, I think, similar to NCR labs, where maybe certain

215
00:28:24,080 --> 00:28:32,240
clients want to make their COAs publicly available. So for example, you know, VitaCAN,

216
00:28:32,240 --> 00:28:46,640
you know, they're trying to get their COAs in the hands of consumers. And just to kind of show you,

217
00:28:48,000 --> 00:28:54,640
so for example, one of the things like Candice was working on, that I was working on, was just

218
00:28:54,640 --> 00:29:12,400
archiving all of these. So basically, let's see if we can't, here one second here, just going to see

219
00:29:12,400 --> 00:29:21,600
if we can't just pull these up real quick for you. So for example, say we wanted to look at VitaCAN

220
00:29:21,600 --> 00:29:33,360
to look at VitaCAN COAs. We can do that. Okay, let me try one more thing.

221
00:29:33,360 --> 00:29:49,760
Okay, maybe not. I think I just had an extra backslash in there. Okay, let's try one more time.

222
00:29:52,160 --> 00:30:01,840
Okay, so they just have, so remember, they just started, you know, eight days ago or so. So they

223
00:30:01,840 --> 00:30:10,240
may not have, well, actually these look like they're from earlier. Oh, so here's the latest one.

224
00:30:10,240 --> 00:30:20,160
So here's one from April 28th. So maybe they don't have all of their COAs here, but you can,

225
00:30:20,160 --> 00:30:32,800
you know, start to look at them. And once again, they've got the terpene values.

226
00:30:32,800 --> 00:30:39,680
And so, okay, so why am I jammering on about all of this?

227
00:30:39,680 --> 00:30:58,880
Well, there's a demand for COA data. So, you know, why don't we basically, you know,

228
00:30:58,880 --> 00:31:05,440
give the people what they want. So, you know, what you can do is, you know,

229
00:31:05,440 --> 00:31:16,720
you can at least find all of these COA links. So, you know, there's, we were just doing them,

230
00:31:17,280 --> 00:31:24,400
finding them through crude manners. So, you know, people, so here's someone who bought some animal

231
00:31:24,400 --> 00:31:37,520
tsunami. And sure enough, they posted their COA. You know, you can basically get the QR code,

232
00:31:38,960 --> 00:31:48,240
then get the official COA. That's my recommendation, because as you can see, people's pictures

233
00:31:48,240 --> 00:31:57,600
sometimes cut off portions of the certificate. So it's best if you can get the official URL,

234
00:31:58,640 --> 00:32:05,760
you know, as we said, you know, go straight to the source. And then, you know, and then we can

235
00:32:05,760 --> 00:32:13,600
basically work on COA parsing tools, because that's where people are around the world.

236
00:32:13,600 --> 00:32:20,080
COA parsing tools, because that's what people are worried about, right? They're saying like,

237
00:32:20,080 --> 00:32:28,800
oh, you know, what's going on with barnesine? And, you know, we've been interested in

238
00:32:28,800 --> 00:32:37,760
limonene and, you know, beta pinyin. And it doesn't really make much sense to just

239
00:32:37,760 --> 00:32:49,200
have to to plough through this COA. So, you know, why don't we basically do this with Python here?

240
00:32:50,000 --> 00:32:56,800
So I'll go ahead and show you how we can do that. But before I do that, any thoughts, comments,

241
00:32:56,800 --> 00:33:08,160
questions before we dive into this code here? Okay. Still with you all, right?

242
00:33:10,560 --> 00:33:19,840
Yep. Okay, phenomenal. So I'll try to tie this, tie this all together and bring it home. So long

243
00:33:19,840 --> 00:33:30,560
story short, you can start archiving some of these COAs. And this is just a little setup here. So,

244
00:33:30,560 --> 00:33:41,280
as I said, we're going to use OpenAIs. So what we can do, well, right, we can find this PDF.

245
00:33:41,280 --> 00:33:50,160
And so, right, so people are posting up their various PDFs here. And so this, this is just

246
00:33:50,720 --> 00:34:03,440
a sample here that somebody tested this. See, it's a goo berry extraction. And this one was tested at

247
00:34:03,440 --> 00:34:13,040
ACS laboratory. And so here's one, right, this one was at Keisha Labs. And I am working on it,

248
00:34:13,040 --> 00:34:22,880
basically a parsing algorithm for for some of these, you know, Keisha certificates, and some of

249
00:34:22,880 --> 00:34:32,320
these ACS certificates. But I started looking at some of this, some of the data, some of these

250
00:34:32,320 --> 00:34:42,000
certificates. So I'll just show you some of the PDFs here. Here, I'll try to find one that's got

251
00:34:42,000 --> 00:35:02,640
a large span of... So, right, here's, you know, like a more recent COA, a one to one CBD THC ball.

252
00:35:04,000 --> 00:35:11,040
Interesting, you know, you can, you know, you're not going to see many terpenes in this ball.

253
00:35:11,040 --> 00:35:25,200
But for some reason, they put in put in this one, a hexa, what's it say, hexa drothymol. So interesting.

254
00:35:26,480 --> 00:35:32,800
But remember, this is a ball. The important thing on this one is just the CBD and the THC

255
00:35:32,800 --> 00:35:43,600
concentrations. And but I realized, you know, not all these COAs are made the same. And, you know,

256
00:35:43,600 --> 00:35:56,400
if you go way back in time, so here's a COA from either... So this looks like a COA from 2019.

257
00:35:56,400 --> 00:36:05,920
And, you know, you don't have as much information. And in fact, you don't even have total THC or CBD

258
00:36:05,920 --> 00:36:14,320
on this one. And as you can kind of see from like the file sizes here, like over time,

259
00:36:15,760 --> 00:36:24,240
there's COA just, you know, gradually changed into, you know, what it is today. But some of the,

260
00:36:24,240 --> 00:36:34,000
you know, the earlier ones are going to be a little different, right? So here's like an early

261
00:36:34,560 --> 00:36:44,880
Caesia lab COA, where this one was in 2019. This one does have total cannabinoids on it,

262
00:36:44,880 --> 00:36:52,800
which is what we're after, at least for some of our analyses, we would like terpenes too,

263
00:36:52,800 --> 00:36:58,560
but at the minimum, we'd like to get cannabinoids. And so what I realized is, you know, it may not

264
00:36:58,560 --> 00:37:11,520
actually make sense to write an algorithm to parse a particular COA, because the formatting's always

265
00:37:11,520 --> 00:37:17,840
changing. And, you know, why is it changing? Well, because, you know, software developers like to be

266
00:37:17,840 --> 00:37:24,880
continuously improving upon their code. And, you know, the laboratories would like to be, you know,

267
00:37:25,600 --> 00:37:31,440
hopefully continuously improving their COAs, you know, to a certain extent. And so, you know,

268
00:37:31,440 --> 00:37:37,280
they maybe they looked at this and they realized, oh, you know, we could add terpenes to it. And so

269
00:37:37,280 --> 00:37:46,720
just over time, you can see, you know, more styling was done, various elements moved around on the

270
00:37:46,720 --> 00:37:55,920
pages, which is, you know, fine and dandy, but it's going to make COA parsing super difficult. But

271
00:37:56,640 --> 00:38:03,680
why should we let that stop us? Because, you know, at the end of the day, the data's there,

272
00:38:03,680 --> 00:38:11,040
so remember, you know, never throw away any data. So this is sort of the problem we're faced with.

273
00:38:12,080 --> 00:38:21,600
People want these lab results. They're there. They're right there on the web. The data is even

274
00:38:21,600 --> 00:38:29,920
on the PDF. But as we said, it's basically locked and unobtainable. Like as we said,

275
00:38:29,920 --> 00:38:37,280
the data is so close, yet so far away, right? You can get all the certificates and, you know,

276
00:38:37,280 --> 00:38:43,360
people are diligently doing that here on Reddit, right? They recognize that the data is so close,

277
00:38:43,360 --> 00:38:50,880
it's so close, let's at least get these URLs. And, you know, they did a noble thing there,

278
00:38:50,880 --> 00:38:56,160
because it pointed us in the direction of them. And, you know, we've got some interesting

279
00:38:56,160 --> 00:39:09,120
tools up our tool belt here. So without further anticipation, let's just go ahead and get into it.

280
00:39:09,120 --> 00:39:18,480
So we've got the first page here of the certificate. So here's this crew clear syringe.

281
00:39:18,480 --> 00:39:32,880
So remember, this is the ACS laboratory certificate for the true clear syringe.

282
00:39:33,760 --> 00:39:43,920
Okay, so let's keep this in mind. So we basically just have all of the text from the front page.

283
00:39:43,920 --> 00:39:53,040
Let me print this out nicely for you. This is just the raw text. So there's the true clear syringe.

284
00:39:55,840 --> 00:40:01,840
You know, here's all the cannabinoid information. Oh, look, you know, there's farinezine.

285
00:40:01,840 --> 00:40:09,200
Actually, just for fun, let's just go ahead and get farinezine. Okay, we'll just let AI deal with

286
00:40:09,200 --> 00:40:26,520
the spelling on this one. So here's the

287
00:40:26,520 --> 00:40:34,880
text. So what if we just tell you, remember, let's try an untrained chat GPT model and just say,

288
00:40:34,880 --> 00:40:45,520
hey, you know, extract as many of these data points to JSON from the following text, right,

289
00:40:45,520 --> 00:40:51,520
because that's basically the same data point that we're going to extract from the following text.

290
00:40:51,520 --> 00:40:56,720
Right, because that's basically what you could tell a human to do. It would take them a long time,

291
00:40:56,720 --> 00:41:03,760
you know, a handful of minutes, and they wouldn't love to do it. But you could tell a human, okay,

292
00:41:04,560 --> 00:41:11,440
put the product name here, put the product type here, put the producer here, put the THC CBD and

293
00:41:11,440 --> 00:41:16,480
some of these terpenes here, right. And then we would go through and we would say, okay, here's

294
00:41:16,480 --> 00:41:27,440
farinezine, they want the farinezine percent, you know, it's 4.807. I'm going to give the model a

295
00:41:27,440 --> 00:41:36,560
supplement prompt here. And I basically am saying only, or I say return only JSON and always return

296
00:41:36,560 --> 00:41:44,480
at least an empty object if no data can be found. So I'm basically saying at least a few points,

297
00:41:44,480 --> 00:41:54,480
so I'm basically saying at least return me something here. And let's just see what happens.

298
00:41:54,480 --> 00:42:02,480
So there's a lot of things that can go wrong here. There's the connection error, hopefully,

299
00:42:02,480 --> 00:42:10,480
you know, I'm streaming after all. So we may not want to expect this to be super speedy, but

300
00:42:10,480 --> 00:42:18,480
as often with proof of concepts, they can always be refined. So the idea is you want to at least

301
00:42:18,480 --> 00:42:24,480
make sure you can do something and then you can, you know, do it better and faster. And so check

302
00:42:24,480 --> 00:42:34,480
this out. So we've got the product name, there's the true clear syringe. It did not get the product

303
00:42:34,480 --> 00:42:50,480
type out of this. Unless it did, but here let's look at the COA real quick. So it didn't quite get

304
00:42:50,480 --> 00:43:00,480
the sample type correct. So it's in sample matrix. Well, I don't know, that's debatable. So we'll

305
00:43:00,480 --> 00:43:10,480
maybe have to try this out on more COAs, but the product type may have been a miss. It got the

306
00:43:10,480 --> 00:43:20,480
producer as true leave. And for some reason, they put it in all capital letters, but I'm not sure why

307
00:43:20,480 --> 00:43:32,480
it did that, but we at least have true leave right there. And check this out. Total THC, we've got 87.

308
00:43:32,480 --> 00:43:38,480
Let me actually just try to put this side by side instead of going back and forth. Okay, so we're

309
00:43:38,480 --> 00:44:00,480
looking for total THC of 80. Yep. So there's 87.66. Then we've got 0.303% CBD. Then we've got 0.099%

310
00:44:00,480 --> 00:44:16,480
beta-pinene, 0.39% d-limonene, and check it out, even got the farnazine, the 4.807. And as I said,

311
00:44:16,480 --> 00:44:24,480
it would be phenomenal to get every single last data point. And so I think we should work on that.

312
00:44:24,480 --> 00:44:34,480
But I think we tried this in some prior weeks where we just tried to get the kitchen sink, and it

313
00:44:34,480 --> 00:44:40,480
didn't quite work very well. I mean, it kind of did, but it was a little hit or miss. So this is

314
00:44:40,480 --> 00:44:48,480
the GPT-4 model. So it's a slightly better model. I think last time we may have been using GPT-3 or

315
00:44:48,480 --> 00:45:02,480
3.5. So it's a slightly better model. Maybe it's a better prompt. It's a different COA. So there's a

316
00:45:02,480 --> 00:45:10,480
lot of things that we can tinker with here. So there's, as Rick was saying at the beginning,

317
00:45:10,480 --> 00:45:17,480
there's a lot of moving pieces. This is kind of an arbitrary prompt. And so this is maybe where they

318
00:45:17,480 --> 00:45:25,480
say people talk about prompt engineering. So I'll post this code. I should have done it before the

319
00:45:25,480 --> 00:45:34,480
meetup, but I'll do it immediately after. And you can try to tinker around with this. So maybe you can

320
00:45:34,480 --> 00:45:43,480
add...that would be an experiment. Just keep adding data points until maybe it breaks. So maybe it can

321
00:45:43,480 --> 00:45:51,480
handle parsing them all. But that's something for you to experiment. And there's other parameters you

322
00:45:51,480 --> 00:45:57,480
can toggle. The temperature is basically how creative you want the model to be. And I thought we

323
00:45:57,480 --> 00:46:05,480
wanted it to not be creative. We don't want it making up answers. So that could be sort of another

324
00:46:05,480 --> 00:46:16,480
check you could do. You could maybe make a query and you can say, is this number in fact in the

325
00:46:16,480 --> 00:46:30,480
COA? So you could even say, oh, is this in the front page text? And it is. So that could potentially be

326
00:46:30,480 --> 00:46:38,480
like various checks you could do after you parse it. Because remember, these are numbers. And it's

327
00:46:38,480 --> 00:46:52,480
one thing for GPT to just turn out a paragraph that you send in an email. It's another thing for it to

328
00:46:52,480 --> 00:47:02,480
turn out cannabinoid percentages that say medical patients in Florida are going to be using to make

329
00:47:02,480 --> 00:47:14,480
purchasing decisions. So we may want to put a little bit of double checking in place to be pretty

330
00:47:14,480 --> 00:47:19,480
certain about these numbers. And in fact, what people always kind of recommend with AI models is it

331
00:47:19,480 --> 00:47:25,480
doesn't hurt to have somebody check them like a human at the end of the day. So what I was thinking

332
00:47:25,480 --> 00:47:37,480
was we can work on the COA doc user interface and basically somebody could upload their certificate

333
00:47:37,480 --> 00:47:47,480
here. We could try to parse it with COA doc or OpenAI and just try to get as many data points as

334
00:47:47,480 --> 00:47:53,480
possible and then just suggest those to the user that, okay, this may be the answer. And then they

335
00:47:53,480 --> 00:48:01,480
can either say yes or no. So you could then parse the COA. You can get this list back. And they may

336
00:48:01,480 --> 00:48:09,480
say, yes, that's the product name. No, that's not the product type. Yes, that's the producer. Yes, yes,

337
00:48:09,480 --> 00:48:23,480
yes, yes. So that way you can kind of have a human confirm this at the end of the day. Well, do you

338
00:48:23,480 --> 00:48:33,480
want to see it put to use at scale? Or what are your thoughts on this one?

339
00:48:33,480 --> 00:48:39,480
Show us if you got it. I'm excited to see what you've got up your sleeve.

340
00:48:39,480 --> 00:48:48,480
Okay. Well, this one's here. I'll just get this one turning away and then let's go look at that other one.

341
00:48:48,480 --> 00:48:58,480
So basically I just wrote this into a function. So this is just the exact work we did here, but just in a

342
00:48:58,480 --> 00:49:07,480
simple function. So this is just the prompt. And so these are the things I think, right, this was just

343
00:49:07,480 --> 00:49:13,480
almost like a throwaway prompt. This is just when I kind of just came up with this morning. But these

344
00:49:13,480 --> 00:49:23,480
may be the things that people, these may be people's prized possessions. Just doing a little

345
00:49:23,480 --> 00:49:29,480
demonstration with some of the image generation, it's all about the prompts. And so people are just trying to

346
00:49:29,480 --> 00:49:35,480
figure out the right, right, that's X, right. You've got to think about this as a statistical function or

347
00:49:35,480 --> 00:49:46,480
output is Y and this is X. So this matters a lot to people. And as I said, that's what statisticians have

348
00:49:46,480 --> 00:49:58,480
fun about at lunch is talking about X, talking about what goes into the model and its parameters.

349
00:49:58,480 --> 00:50:07,480
So you can spend a lot of time on this. Okay. But anywho, just the prompt and then all this is doing is

350
00:50:07,480 --> 00:50:18,480
opening the PDF, getting the front page, formatting the prompt from the front page, then pinging open AI.

351
00:50:18,480 --> 00:50:25,480
It's either going to fail or maybe the parsing fails. But you know, we're going to try to get the data at

352
00:50:25,480 --> 00:50:32,480
the end of the day, because basically, this is what the content looks like. And then we can format that

353
00:50:32,480 --> 00:50:42,480
into a nice tidy base on. So I'll just let this one rip here with I'm just going to do five COAs because I

354
00:50:42,480 --> 00:50:53,480
had it run in full in another interpreter. And I'll basically show you the full results. Okay.

355
00:50:53,480 --> 00:51:02,480
So unfortunately, this first one failed. And this is where a lot of trial and error is needed.

356
00:51:02,480 --> 00:51:10,480
And that's why I encourage you all to to play with these prompts, because I've been kind of running this one

357
00:51:10,480 --> 00:51:22,480
on and off. And it's weird because this first observation sometimes fails and sometimes succeeds.

358
00:51:22,480 --> 00:51:32,480
Which is which is just completely strange and bizarre to me that, you know, you'd get, you know, two different

359
00:51:32,480 --> 00:51:44,480
outputs from the same query, but it happens. So I would, you know, encourage you all to try to try to figure out

360
00:51:44,480 --> 00:51:55,480
why that's happening. So that way we can fix it. But this is where I get into the what I the point I raised at the

361
00:51:55,480 --> 00:52:03,480
beginning was, this is now imperfect. You know, we're definitely in the realm of statistics now, right?

362
00:52:03,480 --> 00:52:15,480
We've got a non random sample. These are just certificates that people were posting non randomly in Florida.

363
00:52:15,480 --> 00:52:25,480
So that's our sample we're working with. It's further biased because certain queries are failing.

364
00:52:25,480 --> 00:52:34,480
So that's an unknown source of bias. So like, why are those are those queries failing?

365
00:52:34,480 --> 00:52:40,480
And it may not actually be necessarily by well, actually, I forget if this is technically bought.

366
00:52:40,480 --> 00:52:44,480
Actually, so there's different things that can happen in your statistics, right?

367
00:52:44,480 --> 00:52:56,480
So if you have measurement error or missing observations, please research if that can lead to bias or not.

368
00:52:56,480 --> 00:53:05,480
I don't know off the top of my head. I want to say that miscalculations can I forget if missing data can or not.

369
00:53:05,480 --> 00:53:12,480
I want to say it may lead to bias. So that's some statistical homework for you.

370
00:53:12,480 --> 00:53:16,480
But these are different concepts in statistics.

371
00:53:16,480 --> 00:53:24,480
You know, bias is a statistical concept, you know, missing data and miscalculations or entry error.

372
00:53:24,480 --> 00:53:29,480
Those are just characteristics of the data.

373
00:53:29,480 --> 00:53:40,480
But here you see we got four out of five. So running around 80 percent, which is awesome.

374
00:53:40,480 --> 00:53:46,480
Right. As I said at the beginning, I'd rather look at any data than no data.

375
00:53:46,480 --> 00:53:58,480
So, you know, let's look at this. So that's the first step, right, is get the data.

376
00:53:58,480 --> 00:54:02,480
Let's make sure I've got it here.

377
00:54:02,480 --> 00:54:09,480
So there it is. And now let's visualize it.

378
00:54:09,480 --> 00:54:13,480
And we've got some Khalifa Kush.

379
00:54:13,480 --> 00:54:23,480
And it's so interesting because I just it's so strange how the world works with these little coincidences.

380
00:54:23,480 --> 00:54:39,480
But I just shared a video with you all on Slack the last week about burner of cookies discussing some of the strains that he's tied to.

381
00:54:39,480 --> 00:55:00,480
And he was saying that, you know, Wiz Khalifa wanted a strain and I think they just maybe, you know, branded maybe a really potent variety of the OG Kush as, you know, Wiz Khalifa's Khalifa Kush.

382
00:55:00,480 --> 00:55:11,480
And, you know, just from what I was watching on the video, they were like saying that, like, you know, he wants to be the only one to grow this variety.

383
00:55:11,480 --> 00:55:25,480
And that's awesome. Well, as cannabis data scientists, you know, we can kind of help at least characterize some of the rates.

384
00:55:25,480 --> 00:55:32,480
So burners, the one who grew the strain, but we can, you know, help them with statistics.

385
00:55:32,480 --> 00:55:40,480
So, you know, maybe or maybe not. They may not know it. Say the average rate.

386
00:55:40,480 --> 00:55:46,480
Is this a Sativa or is this an Indica leaning strain into how much?

387
00:55:46,480 --> 00:56:00,480
And so what we can do is we can look at some of our favorite ratios here. So here, I forget how this function works.

388
00:56:00,480 --> 00:56:05,480
Just simple division.

389
00:56:05,480 --> 00:56:09,480
Okay, let's take the mean.

390
00:56:09,480 --> 00:56:25,480
So, remember, my rule of thumb is anything greater than 0.25 or anything greater than a one to four ratio of beta pinene to D-limonene.

391
00:56:25,480 --> 00:56:31,480
I would think of as a Sativa. We were looking at, say, Durban poison.

392
00:56:31,480 --> 00:56:39,480
And that's closer to a three to four or almost a one to one ratio. And that's when I think of a strong Sativa.

393
00:56:39,480 --> 00:56:45,480
So the higher this number goes, typically, I think the more Sativa leaning it is.

394
00:56:45,480 --> 00:56:50,480
And then the lower it goes, the more Indica it's leaning.

395
00:56:50,480 --> 00:56:59,480
And from looking at a sample of this data, it was looking to me like maybe it's some strong, strong Indicas.

396
00:56:59,480 --> 00:57:04,480
Maybe down towards like 0.1 or 0.15.

397
00:57:04,480 --> 00:57:16,480
So to me, this one looks like it's on the Indica side, maybe not the strongest, like the heaviest Indica out there.

398
00:57:16,480 --> 00:57:22,480
But that's how I would characterize this one.

399
00:57:22,480 --> 00:57:30,480
And then here, here I've just parsed a bunch more. So there's a whole lot more parsing to be done.

400
00:57:30,480 --> 00:57:38,480
So this is just the same script, but I just let it run on 57 strains.

401
00:57:38,480 --> 00:57:42,480
Then we can plot all of these on the map.

402
00:57:42,480 --> 00:57:49,480
So here you see these are ones that I would consider heavy Indicas.

403
00:57:49,480 --> 00:58:00,480
So this one, look at this, they're even calling this one a MPX Sativa OSFG Live ROS.

404
00:58:00,480 --> 00:58:08,480
Right. But that one is just kind of spitballing here.

405
00:58:08,480 --> 00:58:14,480
It's around 0.16.

406
00:58:14,480 --> 00:58:21,480
So it's interesting here because as we were saying, names may not always be perfect.

407
00:58:21,480 --> 00:58:38,480
But to me, this looks like it's labeled the Sativa OSFG.

408
00:58:38,480 --> 00:58:51,480
Double check this name for me, please. But this one's a Live ROS. And as I was saying, in my book, I would categorize that one as a strong Indica.

409
00:58:51,480 --> 00:58:56,480
And I would kind of, you know, different people have their different tastes.

410
00:58:56,480 --> 00:59:00,480
Some people seek out the Indicas.

411
00:59:00,480 --> 00:59:07,480
And then, you know, personally, I'm just a bit more fan of the Sativa, but everybody's got their personal preferences.

412
00:59:07,480 --> 00:59:17,480
So if I was shopping, you know, I would be looking for, you know, one of these like these yellow or green strains.

413
00:59:17,480 --> 00:59:24,480
So, you know, I'd be looking for like, you know, one of these Califa cushions.

414
00:59:24,480 --> 00:59:35,480
And so this is where I was getting into the fact, or remember we were talking about at the beginning.

415
00:59:35,480 --> 00:59:40,480
One of the things we would like to start looking at would be phenotypes.

416
00:59:40,480 --> 00:59:47,480
So what you can do here is you can just look at, hopefully.

417
00:59:47,480 --> 00:59:54,480
Yes, so you can. There's a ton of them. So let's just do like a sample of.

418
00:59:54,480 --> 01:00:02,480
Let's just look at 10 of them.

419
01:00:02,480 --> 01:00:05,480
We'll just look at, actually that may have been too many.

420
01:00:05,480 --> 01:00:14,480
So it looks like they're kind of clustering down there. Maybe it was enough.

421
01:00:14,480 --> 01:00:27,480
It looks like the Califa cushions are clustering down, you know, down here towards, as I was saying, you know, more on the Indica side.

422
01:00:27,480 --> 01:00:37,480
But it's just kind of interesting because, you know, now and again, you'll just have essentially a phenotype or phenotype.

423
01:00:37,480 --> 01:00:46,480
So this is the Califa cush strain. But just for whatever reason, just something about that plant.

424
01:00:46,480 --> 01:00:51,480
Maybe it was an environmental factor. Maybe it was a genetic factor.

425
01:00:51,480 --> 01:00:59,480
If they're growing these from seed instead of clone, if they're all clones, then it must be some sort of environmental factor.

426
01:00:59,480 --> 01:01:04,480
But something happened with these two varieties here.

427
01:01:04,480 --> 01:01:10,480
And in my book, you know, they're just more on the Sativa side.

428
01:01:10,480 --> 01:01:20,480
So maybe if, you know, say, you know, Wiz, Califa and Berner were trying to say branch this into two different product lines.

429
01:01:20,480 --> 01:01:31,480
You know, they may be able to, you know, spin off, you know, a Califa Cush Sativa variety strain.

430
01:01:31,480 --> 01:01:34,480
And they, you know, maybe they mix it with Durbin, right?

431
01:01:34,480 --> 01:01:43,480
Maybe you mix this phenotype with the Durbin, that's, you know, Sativa with some of the Califa Cush attributes that you like.

432
01:01:43,480 --> 01:01:49,480
And then maybe they just keep this phenotype around as their classic line.

433
01:01:49,480 --> 01:01:58,480
Or maybe they would say, oh, this one doesn't match the typical Califa Cush that people are expecting in the market.

434
01:01:58,480 --> 01:02:02,480
So maybe that one doesn't meet their quality control standards.

435
01:02:02,480 --> 01:02:06,480
So, you know, there's a lot you can do with this from here.

436
01:02:06,480 --> 01:02:12,480
But as I was saying, this is just a prediction, right?

437
01:02:12,480 --> 01:02:18,480
We need to actually, I've got all the code up here.

438
01:02:18,480 --> 01:02:22,480
So we've actually see this so strange, right?

439
01:02:22,480 --> 01:02:31,480
Because look, this time, right, when I just ran this code for you over here, the first AI query failed.

440
01:02:31,480 --> 01:02:42,480
But for whatever reason, when I ran it the first time, it succeeded and did give a result.

441
01:02:42,480 --> 01:02:53,480
So as I was saying, you know, more work needs to be done to figure out, you know, why are these requests succeeding and why are they failing?

442
01:02:53,480 --> 01:02:59,480
And, you know, are these results accurate?

443
01:02:59,480 --> 01:03:09,480
So, for example, you know, this one didn't report any beta-pionine or delimiting.

444
01:03:09,480 --> 01:03:17,480
And so what we can do is we can actually pull up that.

445
01:03:17,480 --> 01:03:25,480
You can actually pull up this PDF.

446
01:03:25,480 --> 01:03:34,480
And, you know, we can confirm if this thing has terpenes on it or not.

447
01:03:34,480 --> 01:03:44,480
So here we have the NC, sure enough, it's got beta-pionine and it also has delimiting.

448
01:03:44,480 --> 01:03:49,480
So this would be an inaccurate prediction.

449
01:03:49,480 --> 01:03:57,480
So actually, 0.69.

450
01:03:57,480 --> 01:04:00,480
Oh, wait, actually, hold on.

451
01:04:00,480 --> 01:04:04,480
This one actually, maybe that wasn't the null one.

452
01:04:04,480 --> 01:04:08,480
Maybe these ended up getting sorted somehow.

453
01:04:08,480 --> 01:04:18,480
Okay, so these may not be in order, but let's see if I can find one with missing values.

454
01:04:18,480 --> 01:04:23,480
Okay, so that may have actually worked out okay.

455
01:04:23,480 --> 01:04:28,480
So once again, I'm going to go ahead and start wrapping up since I've gone way over time.

456
01:04:28,480 --> 01:04:35,480
But I would encourage you all to basically, you know, we're going to have to start double checking these, right?

457
01:04:35,480 --> 01:04:41,480
Because it's like if we're going to use this model to extract cannabinoid and terpene data,

458
01:04:41,480 --> 01:04:46,480
then we want to have some measure of accuracy.

459
01:04:46,480 --> 01:05:01,480
So I still think it would be a useful exercise to say parse these through a well-defined algorithm

460
01:05:01,480 --> 01:05:11,480
that we know for certain is pulling all the terpenes and parsing them, you know, say to 99% accuracy

461
01:05:11,480 --> 01:05:15,480
and then compare that to the OpenAI model.

462
01:05:15,480 --> 01:05:20,480
And we would like this OpenAI model to, you know, approach.

463
01:05:20,480 --> 01:05:23,480
It's probably never going to be 100% accurate,

464
01:05:23,480 --> 01:05:30,480
but we would like it to at least approach 100% accuracy even if we never get there.

465
01:05:30,480 --> 01:05:44,480
So that's the first step, I think, or a next step is figure out what exactly is the accuracy of this prompt slash algorithm.

466
01:05:44,480 --> 01:05:50,480
And then, as I was saying, this takes a little bit of time here.

467
01:05:50,480 --> 01:05:58,480
So, you know, this may have taken five to 10 to 15 minutes or so to parse 50.

468
01:05:58,480 --> 01:06:05,480
And also, remember, there's a non-negligible price here.

469
01:06:05,480 --> 01:06:10,480
So actually, I've kind of raked up a little bit of a bill this morning.

470
01:06:10,480 --> 01:06:22,480
So through all of my development and parsing, so remember, you know, I parsed, you know, maybe 100 or so of these.

471
01:06:22,480 --> 01:06:26,480
It has, you know, cost me about five bucks.

472
01:06:26,480 --> 01:06:37,480
So once again, that's not the most in the world, but, you know, it's non-negligible.

473
01:06:37,480 --> 01:06:44,480
But my philosophy is these are such small costs and the data is so valuable.

474
01:06:44,480 --> 01:06:51,480
Surely, the marginal benefit exceeds these tiny marginal costs.

475
01:06:51,480 --> 01:07:00,480
So we've got north of 15,000 COAs that we could potentially parse.

476
01:07:00,480 --> 01:07:06,480
Don't have to do them all with AI, but check this out.

477
01:07:06,480 --> 01:07:16,480
I'll just tie this off at the end of the day with remember at the beginning, why are we doing all of this?

478
01:07:16,480 --> 01:07:28,480
Because somebody says, you know, can't we just get a universal COA site where we can say search and look at results?

479
01:07:28,480 --> 01:07:34,480
Well, ask and you shall receive.

480
01:07:34,480 --> 01:07:40,480
So I encourage you all to put this in your browser and smoke it.

481
01:07:40,480 --> 01:07:50,480
And this is just a work in progress, but a this is an extreme work in progress.

482
01:07:50,480 --> 01:07:54,480
So this is just a pure development version.

483
01:07:54,480 --> 01:07:59,480
And so there's still a lot more work to be done, especially on the parsing front.

484
01:07:59,480 --> 01:08:04,480
So I haven't incorporated the parsing yet, but this is just a retrieval tool.

485
01:08:04,480 --> 01:08:14,480
But, you know, now if we start populating a database, this may have been too much to ask for.

486
01:08:14,480 --> 01:08:17,480
OK, may have crashed. OK.

487
01:08:17,480 --> 01:08:25,480
So as I said, there's more work to be done to make sure that the website doesn't crash.

488
01:08:25,480 --> 01:08:39,480
And you can do cool things with this list, you know, like all the fun things people like to do, like sort, filter, search, download.

489
01:08:39,480 --> 01:08:48,480
So right now it just has the this one doesn't I'm not sure if this is a valid PDF.

490
01:08:48,480 --> 01:08:50,480
Oh, yeah. Check that out.

491
01:08:50,480 --> 01:08:56,480
So here's just a rosin from from green scientific.

492
01:08:56,480 --> 01:09:04,480
So, you know, this and the search algorithm I cooked up last night is is far from perfect.

493
01:09:04,480 --> 01:09:12,480
So there's a thousand and one ways that this can be improved upon or just taking an entirely different direction.

494
01:09:12,480 --> 01:09:19,480
So, Rick, you know, if you want to to work on, you know, an entirely different aspect, you're welcome to.

495
01:09:19,480 --> 01:09:26,480
There's just this is just a demonstration that, hey, this data is there.

496
01:09:26,480 --> 01:09:32,480
These are just links to PDFs that exist on the Web.

497
01:09:32,480 --> 01:09:37,480
We can make these readily searchable.

498
01:09:37,480 --> 01:09:40,480
And so now we can at least get these ways.

499
01:09:40,480 --> 01:09:47,480
So now somebody can at least search and find a way.

500
01:09:47,480 --> 01:09:57,480
Well, you know, now we have written just this morning an algorithm to actually parse this.

501
01:09:57,480 --> 01:10:03,480
So once again, it's going to be a non zero cost.

502
01:10:03,480 --> 01:10:14,480
But somebody could now try to parse this way, get the cannabinoids or the terpenes out of it.

503
01:10:14,480 --> 01:10:18,480
And as I was saying, we may have to adjust the parsing algorithm.

504
01:10:18,480 --> 01:10:22,480
Right. This one is just looking at the front page.

505
01:10:22,480 --> 01:10:28,480
But as we see here, the front page contains some of the information,

506
01:10:28,480 --> 01:10:37,480
but we'll have to go to some of the subsequent pages to actually get the terpene data.

507
01:10:37,480 --> 01:10:48,480
So once again, this parsing algorithm is going to need to be tailored so that maybe you say iterate over all of the pages

508
01:10:48,480 --> 01:10:53,480
until you've gotten all the data points that you're looking for.

509
01:10:53,480 --> 01:10:59,480
And once again, we're only getting, say, two or three terpenes here.

510
01:10:59,480 --> 01:11:04,480
Ideally, we could get all of the terpenes.

511
01:11:04,480 --> 01:11:11,480
So we're going to have to do this in a way that's clever prompting to do.

512
01:11:11,480 --> 01:11:17,480
So that's sort of the work that I've been tinkering on.

513
01:11:17,480 --> 01:11:32,480
You know, I encourage you all to play out, play around with this tool because I populated this with, as I said, north of 15,000 COAs.

514
01:11:32,480 --> 01:11:36,480
So that's why I'm kind of opening it up to you to kind of do development.

515
01:11:36,480 --> 01:11:48,480
I'm trying to sort of get an estimation about, you know, what is the marginal cost of making all this data searchable, parsable.

516
01:11:48,480 --> 01:11:58,480
And once it's parsed, then you'll have, you know, your nice data file here that awesome people like Rick and whoever.

517
01:11:58,480 --> 01:12:16,480
So you can now finally use Panabinoid Terpene data, all the data on the certificate, to build rich models and find amazing insights.

518
01:12:16,480 --> 01:12:21,480
So I encourage you all to play around with this.

519
01:12:21,480 --> 01:12:34,480
Submit issues, grab your keyboard and contribute, clone the code and use it in your own projects.

520
01:12:34,480 --> 01:12:36,480
Use it however you wish.

521
01:12:36,480 --> 01:12:42,480
And then if you scroll down here, you're also free to contribute.

522
01:12:42,480 --> 01:12:54,480
Because as I said, you know, I'm going to kind of be watching the usage here and I'm going to have to maybe think about, you know,

523
01:12:54,480 --> 01:13:05,480
if I make this tool widely available, it would be, I may unfortunately either have to charge for the service

524
01:13:05,480 --> 01:13:14,480
or if any of you want to contribute, then you're welcome to as well or run it on your own machine.

525
01:13:14,480 --> 01:13:17,480
And, you know, as I said, there's going to be a cost there.

526
01:13:17,480 --> 01:13:27,480
But as I said, there's a small marginal cost, but we've got the sources, we've got the methods, we've got the data.

527
01:13:27,480 --> 01:13:30,480
Let's dive into the analysis.

528
01:13:30,480 --> 01:13:46,480
So please feel free to contribute or explore or, you know, check out the GitHub and tinker to your heart's content.

529
01:13:46,480 --> 01:13:59,480
Or at the very least, give the repository a star if this is something that you find useful.

530
01:13:59,480 --> 01:14:02,480
What is everybody's thoughts, comments, questions?

531
01:14:02,480 --> 01:14:09,480
I know it was a lot of material, but it seems like we've finally achieved or not achieved,

532
01:14:09,480 --> 01:14:26,480
but we're finally starting to make progress on some tangible progress that people can put in their hands and use or put in their pipes and use.

533
01:14:26,480 --> 01:14:34,480
Did anybody have any thoughts they want to share?

534
01:14:34,480 --> 01:14:38,480
I needed that review, so thank you very much.

535
01:14:38,480 --> 01:14:42,480
It was good to, you know, for my first time back up for a long time.

536
01:14:42,480 --> 01:14:44,480
So thank you for that.

537
01:14:44,480 --> 01:14:50,480
But I'll go through the GitHub again, as well as the videos like we talked about.

538
01:14:50,480 --> 01:14:51,480
Thank you.

539
01:14:51,480 --> 01:15:00,480
And there's a lot there, and as I would really want to drive home, this is more of a demonstration.

540
01:15:00,480 --> 01:15:12,480
A lot of this needs to be perfected or if not perfected, probably, right, we can never get to 100 percent, but it can always be improved upon.

541
01:15:12,480 --> 01:15:24,480
So that was just a first iteration of what a COA parsing prompt could look like or a lab results search could look like.

542
01:15:24,480 --> 01:15:29,480
So these are all just a first iteration.

543
01:15:29,480 --> 01:15:38,480
And as I was pointing out, there's many, many imperfections and things that can be improved upon.

544
01:15:38,480 --> 01:15:53,480
So I encourage you all to use it, make it your own, tinker, explore, pester me if you want anything added to it or you want any data.

545
01:15:53,480 --> 01:15:57,480
So I think there's a lot there.

546
01:15:57,480 --> 01:16:01,480
We've just begun to explore.

547
01:16:01,480 --> 01:16:03,480
Rick, comment, question?

548
01:16:03,480 --> 01:16:21,480
No, I just wanted to say awesome presentation that I mean, you met or created a proof of concept, a really good one for a problem, a real problem, and use some cutting edge technology to do it.

549
01:16:21,480 --> 01:16:23,480
I think that you're right.

550
01:16:23,480 --> 01:16:30,480
You could improve the cost associated with it through like playing around with different models and stuff like that.

551
01:16:30,480 --> 01:16:44,480
I've been doing some like computer vision training and stuff, which you might that might be an application for this as well in terms of like collecting specific stuff and parsing it in a format that you want consistently.

552
01:16:44,480 --> 01:16:47,480
I'm going to take a look at the GitHub.

553
01:16:47,480 --> 01:16:52,480
I haven't seen it yet, so I'm excited and I'll try to contribute as much as I can.

554
01:16:52,480 --> 01:16:55,480
So thank you.

555
01:16:55,480 --> 01:17:03,480
Oh, phenomenal. And remember, this is actually only a cost we technically have to bear once.

556
01:17:03,480 --> 01:17:07,480
So once the COAs parsed, it's parsed.

557
01:17:07,480 --> 01:17:10,480
And so that's where you were mentioning your vector databases.

558
01:17:10,480 --> 01:17:13,480
And that's exactly what it can be used for.

559
01:17:13,480 --> 01:17:27,480
Parse the COA once that may cost one or two cents or say five to ten cents or what have you through OpenAI, but then save that in your vector database.

560
01:17:27,480 --> 01:17:34,480
So that way, if you see a PDF, remember we talked about hashes, it would be similar with a vector.

561
01:17:34,480 --> 01:17:43,480
So as soon as you see a PDF, you create a vector for it, you see if that vector already exists in your database.

562
01:17:43,480 --> 01:17:49,480
If it does, you can get the lab results without having to reparse it.

563
01:17:49,480 --> 01:17:56,480
If it doesn't yet exist, then you, you know, you parse it and save it in your database.

564
01:17:56,480 --> 01:17:59,480
So that could be a way you could save costs right off the bat.

565
01:17:59,480 --> 01:18:10,480
Well, if there's fifteen hundred of them out there right now, I'll play around with saving them to my vector database and then let you know how it works out.

566
01:18:10,480 --> 01:18:14,480
Is please I encourage you all to play play around with it.

567
01:18:14,480 --> 01:18:19,480
And as I was showing you, I just started with a sample size of five.

568
01:18:19,480 --> 01:18:26,480
And then as you saw, gradually worked up to a sample size of around fifty five.

569
01:18:26,480 --> 01:18:33,480
So as I was I always mentioned, I'm a small sample guy.

570
01:18:33,480 --> 01:18:35,480
You can do a lot with the small sample.

571
01:18:35,480 --> 01:18:41,480
So I encourage you to put five of these observations or put fifty in your vector database.

572
01:18:41,480 --> 01:18:48,480
If it proves useful, works well for your application, fill it up with fifteen hundred.

573
01:18:48,480 --> 01:18:52,480
But as I said, the cost may get a little high.

574
01:18:52,480 --> 01:19:05,480
But if there's value, I think hopefully there will be people out there who'd be willing to chip in a couple bucks to get their COA data.

575
01:19:05,480 --> 01:19:09,480
Phenomenal. Thank you all for coming.

576
01:19:09,480 --> 01:19:14,480
Your eyes, your ears, your attention. This is all moving the ball forward.

577
01:19:14,480 --> 01:19:21,480
This wouldn't happen without you. So you're the ones who are really helping this actually happen.

578
01:19:21,480 --> 01:19:34,480
And move everything forward. It's all the thoughts, the questions, the ideas that have happened throughout the whole cannabis data science meetup that's led us on this journey.

579
01:19:34,480 --> 01:19:42,480
You realize people wanted COAs. You realize there was a way we could parse them, get the text out of them.

580
01:19:42,480 --> 01:19:50,480
And we started exploring open AI tools and people wanted to learn more about those and learn more about terpene profiles and phenotypes.

581
01:19:50,480 --> 01:19:58,480
It's just all of this ties together until we are where we are today, where we realize, oh, there's Florida patients.

582
01:19:58,480 --> 01:20:04,480
They've been asking for their certificates like we encourage people to do.

583
01:20:04,480 --> 01:20:11,480
There's finally a lot of these URLs that are available because companies are making them available.

584
01:20:11,480 --> 01:20:22,480
That's my final point for today is we started asking and now we saw large companies like Cureleaf or putting these in VitaCann.

585
01:20:22,480 --> 01:20:27,480
They're putting them out by the hundreds, if not thousands. And it's becoming popular.

586
01:20:27,480 --> 01:20:35,480
So there's a company in Florida, The Flowery, and they do what's called COA drops.

587
01:20:35,480 --> 01:20:43,480
And so every few weeks or month, they'll just drop, say, 20 to 60 COAs.

588
01:20:43,480 --> 01:20:47,480
And that's their new inventory they're releasing at the store.

589
01:20:47,480 --> 01:20:55,480
And it's brilliant because we were talking about these tests can run up to, say, 500 bucks a pop.

590
01:20:55,480 --> 01:21:03,480
So if you tested 20 samples at 500 bucks a pop, that's like a $10,000 bill.

591
01:21:03,480 --> 01:21:11,480
And instead of just eating that bill, you actually get to use it for your benefit.

592
01:21:11,480 --> 01:21:18,480
So they're now using their COAs as a marketing tool. So they're like they're hyping it up.

593
01:21:18,480 --> 01:21:23,480
They're saying, OK, we're going to drop the COAs. And then they release them.

594
01:21:23,480 --> 01:21:29,480
And then everybody's excited about it. They're looking through them and they're talking about them on Reddit.

595
01:21:29,480 --> 01:21:36,480
They're saying, like, oh, like, check this one out. This one's got this one's got a ton of D-Lemonine or oh, watch out.

596
01:21:36,480 --> 01:21:49,480
This one's this one's a foreign bomb. So you can get quick feedback, get people excited for your products and use the lab result data to your advantage.

597
01:21:49,480 --> 01:21:57,480
Instead of just letting the COAs accumulate dust on somebody's old dusty hard drive.

598
01:21:57,480 --> 01:22:03,480
So I love it seeing companies putting this data to good use.

599
01:22:03,480 --> 01:22:10,480
It's getting consumers excited about it. And because of all of you awesome cannabis data scientists,

600
01:22:10,480 --> 01:22:19,480
we can add our piece to the puzzle and actually make the data readily accessible and analyzable.

601
01:22:19,480 --> 01:22:31,480
So and then it's up to all of you to start drawing nifty little insights. So what is the average for Nizine that you can expect?

602
01:22:31,480 --> 01:22:39,480
You know, what's the ranges on that? You know, what's the range on this beta pinin to D-Lemonine ratio?

603
01:22:39,480 --> 01:22:46,480
Like what's the strongest sativa in the Florida market? And what's the strongest indica?

604
01:22:46,480 --> 01:22:55,480
You know, you can start to discover these things and you know, you may find gold.

605
01:22:55,480 --> 01:23:02,480
Too cool. I'll get all these links and material posted and available for you.

606
01:23:02,480 --> 01:23:13,480
I encourage you all to to get in touch throughout the week, even if you want to use some of this material, you know, in your own projects or however you see fit.

607
01:23:13,480 --> 01:23:30,480
You know, let's just keep working on this, because as I said, as soon as I discovered this, all of a sudden, you know, now there's a, you know, a thousand and one more things to do that we can just keep slowly and methodically tinkering on it.

608
01:23:30,480 --> 01:23:36,480
Well, you need to go out and enjoy your day. I've kept you too long.

609
01:23:36,480 --> 01:23:44,480
Thank you one more time for coming, helping advance candidate science.

