1
00:00:00,000 --> 00:00:13,920
Welcome to the Cannabis Data Science Meetup Group for October 13th.

2
00:00:13,920 --> 00:00:16,160
Hopefully you're all doing fantastic.

3
00:00:17,360 --> 00:00:23,520
As always, here to have a little fun, talk about cannabis data, and see if we can discover some insights.

4
00:00:23,520 --> 00:00:32,400
So, I'm going to go ahead and kick off the presentation.

5
00:00:39,600 --> 00:00:48,880
All right, we're going to do a little, you know, novel analysis today.

6
00:00:48,880 --> 00:00:52,800
So, we're essentially picking up where we left off last week.

7
00:01:00,240 --> 00:01:14,320
And we were given a market analysis of Massachusetts, and last week we estimated the crisis, and we were going to see if we couldn't have a rough estimate of profits.

8
00:01:14,320 --> 00:01:22,640
So, to just show you some macroeconomic material here.

9
00:01:24,960 --> 00:01:31,440
Here are some notes that I've written on macroeconomics that you can find on my website.

10
00:01:33,440 --> 00:01:34,800
Nothing glamorous here.

11
00:01:34,800 --> 00:01:44,640
Just a recap of some economics, and welcome to the meetup, Heather.

12
00:01:44,640 --> 00:01:45,920
Welcome, Heather.

13
00:01:47,920 --> 00:01:52,720
We're just going over a recap of some of the economics we've touched on.

14
00:01:52,720 --> 00:02:02,720
Then we're going to dive into the Massachusetts data, and then we're just going to keep extending if we finish with Massachusetts.

15
00:02:02,720 --> 00:02:09,520
I've already lined up a couple of data sets coming from California that we can start to wrangle.

16
00:02:09,520 --> 00:02:22,400
So, we're just going to keep calculating these economic statistics, state by state, until we can eventually calculate the aggregate economic statistics.

17
00:02:22,400 --> 00:02:39,680
In this case, we're doing a market analysis of Massachusetts, trying to discover the crisis, and then we can begin to perhaps estimate the profits of the companies.

18
00:02:39,680 --> 00:02:54,960
As I was hitting this hit fall, I realized that the economic models, when we're estimating the competitive prices, assume that profits are zero.

19
00:02:54,960 --> 00:03:10,160
So, this may not necessarily pan out, but we can give it a shot.

20
00:03:10,160 --> 00:03:25,360
Long story short, we're just going to copy some of the code from the prior weeks.

21
00:03:25,360 --> 00:03:42,160
And then we're going to get the data.

22
00:03:42,160 --> 00:04:01,360
Welcome, Jessica. We're just doing a bit of just some crude analysis of some data in Massachusetts.

23
00:04:01,360 --> 00:04:06,480
So, welcome to the group. Thank you.

24
00:04:06,480 --> 00:04:22,560
Well, I guess just to give you a bit about us. So, I'm starting the company, CanLibx, and the main focus that we do is data analytics.

25
00:04:22,560 --> 00:04:37,360
And so, we've started this group, and essentially, each week we wrangle some data and calculate some statistics.

26
00:04:37,360 --> 00:04:45,520
So, you're welcome to share a bit about yourself to the group if you'd like, but that's entirely up to you.

27
00:04:45,520 --> 00:04:54,720
Sure, no problem. I actually just came across your guys' meetup this morning. I recently started a data analytics boot camp with Springboard.

28
00:04:54,720 --> 00:05:05,760
So, my career background is 20 years of marketing and sales, but within those positions, I've always done some types of KPI reporting and things like that.

29
00:05:05,760 --> 00:05:17,760
So, I'm now getting certified and learning Python, Tableau, Power BI, things like that. But again, early stages of those programs, pretty efficient in Excel analytics.

30
00:05:17,760 --> 00:05:25,360
So, I thought this would just be an interesting meetup just to sit in, kind of observe, maybe make a few connections and go from there.

31
00:05:25,360 --> 00:05:25,760
Awesome.

32
00:05:25,760 --> 00:05:27,200
Programmed through?

33
00:05:27,200 --> 00:05:28,240
I'm sorry?

34
00:05:28,240 --> 00:05:30,080
Where's your program through?

35
00:05:30,080 --> 00:05:34,320
It's an online program called Springboard. It's powered by Microsoft.

36
00:05:34,320 --> 00:05:35,200
Thank you.

37
00:05:35,200 --> 00:05:35,840
Yep.

38
00:05:35,840 --> 00:05:39,360
Microsoft, okay.

39
00:05:39,360 --> 00:05:43,360
Well, excellent to have you. You're in the right place.

40
00:05:43,360 --> 00:05:45,280
Fantastic.

41
00:05:45,280 --> 00:05:54,080
Essentially, we're just beginning. So, essentially, I've just written some Python packages here that we'll be using.

42
00:05:54,080 --> 00:06:08,640
And essentially, we're going to be using the Stokrata API to get data from Massachusetts.

43
00:06:08,640 --> 00:06:15,920
So, for example, we'll be working with this data set here.

44
00:06:15,920 --> 00:06:35,520
So, there's documentation about how you can get started and how to get an API token, which is optional, and then the fields that we'll be working with.

45
00:06:35,520 --> 00:06:50,240
So, long story short, we'll be making requests to the API, to this URL.

46
00:06:50,240 --> 00:07:05,200
And we'll essentially just start by getting this production data here.

47
00:07:05,200 --> 00:07:20,480
And these are the variables we're working with. I simply defined them here just so that they're easy to see.

48
00:07:20,480 --> 00:07:25,680
You know, we can see the last five observations.

49
00:07:25,680 --> 00:07:37,760
If you want to look at the last observation, you can turn that into a dictionary.

50
00:07:37,760 --> 00:07:55,600
And so, we see, okay, we've got data through October 12th, at which point there were 127,000 mature plants, so on and so forth.

51
00:07:55,600 --> 00:08:04,240
And we can even see sales. So, this, I believe, is cumulative sales.

52
00:08:04,240 --> 00:08:14,560
So, we can essentially calculate daily sales by simply taking the difference from day to day.

53
00:08:14,560 --> 00:08:22,400
So, that's the first data point that we're adding.

54
00:08:22,400 --> 00:08:42,480
So, now we can actually look at the last observation and see, okay, on October 12th, there were, you know, about 5.5 thousand in sales in Massachusetts.

55
00:08:42,480 --> 00:08:59,040
There are a couple outliers here that we just need to clean up. So, sales that are much higher than believable, as well as negative sales.

56
00:08:59,040 --> 00:09:02,320
So, those may have been adjustments that happened.

57
00:09:02,320 --> 00:09:20,800
Next, we're just going to aggregate the data into, you know, monthly and quarterly series.

58
00:09:20,800 --> 00:09:34,160
When you aggregate, there's two ways you can aggregate, right? You can either take the sum, so that would be what we're doing when we're looking at sales, right?

59
00:09:34,160 --> 00:09:43,760
Because when you aggregate sales, you're looking at the total sales that occur during the week or month or quarter.

60
00:09:43,760 --> 00:09:57,200
And when you aggregate something like total employees, well, it would make sense just to aggregate some total employees there.

61
00:09:57,200 --> 00:10:05,200
You're more looking at an average, you know, what's the average number of employees at a given point during the month.

62
00:10:05,200 --> 00:10:17,520
The same is true for for plants and packages, right? You're more looking for the average there.

63
00:10:17,520 --> 00:10:25,520
So, we'll create these series.

64
00:10:25,520 --> 00:10:43,920
Then another Socrata data set here is simply the licenses.

65
00:10:43,920 --> 00:10:53,920
So, these are, you know, the business name, the license type.

66
00:10:53,920 --> 00:11:05,520
And there's actually many fields for it, for the licenses. So, there's a lot of, you know, rich analysis that you can do here.

67
00:11:05,520 --> 00:11:12,480
So, for example, let's go ahead and read that data in so that we can take a look at it.

68
00:11:12,480 --> 00:11:24,480
So, essentially just reading in the licensees here.

69
00:11:24,480 --> 00:11:30,400
Here are the last five.

70
00:11:30,400 --> 00:11:36,800
And here it is the last observation.

71
00:11:36,800 --> 00:11:48,400
As you can see, we have quite a number of data points.

72
00:11:48,400 --> 00:11:55,600
We like to quantify things here. So,

73
00:11:55,600 --> 00:12:03,040
we can say, oh, you know, what's the length of all of these?

74
00:12:03,040 --> 00:12:10,320
There's 51 data points for each licensee.

75
00:12:10,320 --> 00:12:15,840
So, that's awesome. You know, the more the merrier.

76
00:12:15,840 --> 00:12:23,840
They've got some real cool things here. So, for example, they've got, they already have the address geocoded.

77
00:12:23,840 --> 00:12:29,920
So, you could already do some interesting plotting on a map.

78
00:12:29,920 --> 00:12:35,120
And we've been meaning to revisit mapping.

79
00:12:35,120 --> 00:12:50,400
So, I'm not sure if we'll get around to it this week, but perhaps if not this week, then maybe in the near future, we can start plotting some of these on a map.

80
00:12:50,400 --> 00:12:59,840
And then the main data points here that we'll be using are, okay, you know, the license type.

81
00:12:59,840 --> 00:13:08,160
And simply just the count, you know, how many licensees are there?

82
00:13:08,160 --> 00:13:12,560
So,

83
00:13:12,560 --> 00:13:18,400
2750.

84
00:13:18,400 --> 00:13:28,000
We're doing a market analysis here. So, ultimately, we're trying to price everything.

85
00:13:28,000 --> 00:13:32,880
And so, we priced everything last week, just sort of recapping real quick.

86
00:13:32,880 --> 00:13:44,160
And so, we're given the price of flour,

87
00:13:44,160 --> 00:13:49,920
which is just the average price per ounce.

88
00:13:49,920 --> 00:13:55,360
And then we just created a couple supplementary series here

89
00:13:55,360 --> 00:14:02,960
just to put this into, okay, so what's, you know, the price per ounce?

90
00:14:02,960 --> 00:14:10,480
I'm not sure.

91
00:14:10,480 --> 00:14:13,120
I guess.

92
00:14:13,120 --> 00:14:25,280
And this is something that we noted in prior weeks was, okay, where there's actually missing data here in April of 2020.

93
00:14:25,280 --> 00:14:35,760
So, that's actually why there's the big dip. So, it's not that necessarily that there was.

94
00:14:35,760 --> 00:14:40,800
And this is the unfortunate thing about missing the data there, right?

95
00:14:40,800 --> 00:14:47,680
Because clearly there was, you know, a shock during that time.

96
00:14:47,680 --> 00:14:52,240
So, their data definitely probably fluctuated.

97
00:14:52,240 --> 00:14:57,680
But unfortunately,

98
00:14:57,680 --> 00:15:05,920
the data reporting at that time is not great. So,

99
00:15:05,920 --> 00:15:09,120
yes, we have a conundrum.

100
00:15:09,120 --> 00:15:12,080
We'll just move forward.

101
00:15:12,080 --> 00:15:19,920
Now, just stating that.

102
00:15:19,920 --> 00:15:26,080
Okay.

103
00:15:26,080 --> 00:15:28,560
I think we still need some of these data points.

104
00:15:28,560 --> 00:15:41,440
So, essentially, we've supplemented these data points with just a handful of other data points that are collected by the Federal Reserve here.

105
00:15:41,440 --> 00:15:47,120
So, for example, we're grabbing the population in Massachusetts.

106
00:15:47,120 --> 00:15:56,080
And we're grabbing, you know, the GDP in Massachusetts.

107
00:15:56,080 --> 00:16:02,400
And you see here, you know, quarter two of 2020, there is a significant dip here.

108
00:16:02,400 --> 00:16:10,160
So, perhaps this dip in price is representative of what actually happened.

109
00:16:10,160 --> 00:16:21,360
However, we'll show you, you know, if you look at sales,

110
00:16:21,360 --> 00:16:36,880
if you look at sales, you'll see that, oh, we're missing sales really between the end of March of 2020 and the end of May of 2020.

111
00:16:36,880 --> 00:16:42,080
So, that's real unfortunate because

112
00:16:42,080 --> 00:16:50,560
it would have been interesting to know what happened with sales during that period.

113
00:16:50,560 --> 00:16:58,640
So, that's actually an opportune time for, right?

114
00:16:58,640 --> 00:17:08,160
So, we've been talking about creating value with data and by curating data.

115
00:17:08,160 --> 00:17:14,960
And so, people are in possession of that data, right?

116
00:17:14,960 --> 00:17:23,120
So, if you're a retailer, you may be in possession of your own sales data during that time.

117
00:17:23,120 --> 00:17:32,800
And so, you know, wouldn't it be so cool if, you know, all of the retailers could essentially form like a data co-op

118
00:17:32,800 --> 00:17:44,240
and, you know, basically say, report, OK, you know, this is what, you know, total sales were during this period in Massachusetts.

119
00:17:44,240 --> 00:17:49,600
Because that data is valuable and missing.

120
00:17:49,600 --> 00:17:55,120
And it exists somewhere.

121
00:17:55,120 --> 00:17:58,320
Anyways, more of that to come.

122
00:17:58,320 --> 00:18:13,360
So, that's, as you can see, a thing that we've been driving here at CanLinux is, you know, there's actually a lot of value gained by, you know, strategically,

123
00:18:13,360 --> 00:18:24,320
you know, curating your data and, you know, getting it into a usable format and supplementing it with other data points.

124
00:18:24,320 --> 00:18:31,280
And there's a lot you can do.

125
00:18:31,280 --> 00:18:39,040
Long story short, I'm going to go ahead and read in this data in case we haven't already.

126
00:18:39,040 --> 00:19:00,800
And, you know, that's simply the same data that we're looking at here, except just for the tail.

127
00:19:00,800 --> 00:19:07,840
We're just looking at this segment.

128
00:19:07,840 --> 00:19:17,840
So, any who you've read in this data.

129
00:19:17,840 --> 00:19:24,880
OK, now this is where we were going to get into the economics.

130
00:19:24,880 --> 00:19:45,440
And so, I just wanted to give a bit of a recap here and actually estimate a slightly simpler model this week and see if we can't get a better measure of the rate of return here.

131
00:19:45,440 --> 00:20:02,080
So, so one way we can estimate the rate of return is, OK, so what if.

132
00:20:02,080 --> 00:20:06,720
So last week we said, OK, there was two parameters, alpha and beta.

133
00:20:06,720 --> 00:20:12,080
In this week, we're just going to say, OK, it's alpha and one minus alpha.

134
00:20:12,080 --> 00:20:17,440
And so that restricts the model to having constant returns to scale.

135
00:20:17,440 --> 00:20:22,080
So that's an assumption that's built into the model.

136
00:20:22,080 --> 00:20:26,720
So that's not great.

137
00:20:26,720 --> 00:20:34,480
Well, that's a subjective statement, but it is another assumption built into the model.

138
00:20:34,480 --> 00:20:37,040
It's a simpler model.

139
00:20:37,040 --> 00:20:44,480
So we're just going to be estimating one parameter now, we're just going to be estimating alpha.

140
00:20:44,480 --> 00:20:53,600
So, and the way we do this is we basically just divide the whole equation by.

141
00:20:53,600 --> 00:20:56,240
So here n is labor.

142
00:20:56,240 --> 00:21:05,840
So if we just divide everything by n, the n is, you know, one to the one minus alpha is one.

143
00:21:05,840 --> 00:21:12,080
So that n disappears and then you're just left with.

144
00:21:12,080 --> 00:21:20,160
KT divided by alpha to the alpha.

145
00:21:20,160 --> 00:21:27,520
And YT divided by NT.

146
00:21:27,520 --> 00:21:35,200
And then if you take the log of both sides, so this is called log linearizing.

147
00:21:35,200 --> 00:21:40,480
That way you have a linear equation that you can estimate.

148
00:21:40,480 --> 00:21:48,240
Then you have the log of output per labor.

149
00:21:48,240 --> 00:21:53,360
And that should equal a constant.

150
00:21:53,360 --> 00:22:02,800
Plus alpha times the log of capital per labor.

151
00:22:02,800 --> 00:22:19,920
So this is sort of our theoretical framework for estimating alpha.

152
00:22:19,920 --> 00:22:23,440
Then, yeah, go ahead and just show you the rest of the economics.

153
00:22:23,440 --> 00:22:29,120
And so I'm going to open this in another cab so I can keep this spot.

154
00:22:29,120 --> 00:22:45,360
So the reason we want to estimate alpha.

155
00:22:45,360 --> 00:22:54,320
Yes, because so, you know, given this model, right, where we've got.

156
00:22:54,320 --> 00:23:00,880
Right, so production is now going to equal capital per labor to the alpha.

157
00:23:00,880 --> 00:23:05,520
So this K is capital per labor to the alpha.

158
00:23:05,520 --> 00:23:09,200
So alpha equals capital per labor to the alpha.

159
00:23:09,200 --> 00:23:16,320
So.

160
00:23:16,320 --> 00:23:26,160
Capital should get paid its marginal product in a competitive market.

161
00:23:26,160 --> 00:23:29,120
So if you take.

162
00:23:29,120 --> 00:23:36,480
So the marginal product of capital is the derivative of the production function with respect to capital.

163
00:23:36,480 --> 00:23:40,160
So if you take the derivative with respect to capital.

164
00:23:40,160 --> 00:23:46,960
So that's going to be the derivative of K to the alpha to the K.

165
00:23:46,960 --> 00:23:48,720
So that's just going to be OK.

166
00:23:48,720 --> 00:23:52,240
The way you do this is you bring the exponent down front.

167
00:23:52,240 --> 00:23:59,120
Alpha times K to the alpha minus one.

168
00:23:59,120 --> 00:24:10,080
And that will be the marginal product of capital, which will be the rate of return of capital in a competitive.

169
00:24:10,080 --> 00:24:12,400
Market.

170
00:24:12,400 --> 00:24:14,080
So.

171
00:24:14,080 --> 00:24:16,880
So this is what we're boiling it down to.

172
00:24:16,880 --> 00:24:27,040
So we're basically going to say, OK, the rate of return rate is going to be alpha times K over L.

173
00:24:27,040 --> 00:24:32,080
Capital per labor already to the alpha minus one.

174
00:24:32,080 --> 00:24:40,240
So that's the theoretical framework that we're going to use to estimate the rate of return.

175
00:24:40,240 --> 00:24:42,720
Here in Massachusetts.

176
00:24:42,720 --> 00:24:54,240
Because I wasn't satisfied with how we did it last week, so we're going to use a slightly simpler model and see what our estimate is.

177
00:24:54,240 --> 00:24:58,240
That's the theoretical framework.

178
00:24:58,240 --> 00:25:02,240
And so now we're going to.

179
00:25:02,240 --> 00:25:08,240
Estimate it with the data.

180
00:25:08,240 --> 00:25:12,240
So we can define our variables first.

181
00:25:12,240 --> 00:25:20,240
So right out of the gate, we're just defining why as sales.

182
00:25:20,240 --> 00:25:28,240
And we're going to be working with the monthly series here as.

183
00:25:28,240 --> 00:25:34,240
As we'll see with.

184
00:25:34,240 --> 00:25:38,240
Well.

185
00:25:38,240 --> 00:25:44,240
We may not necessarily need to work with.

186
00:25:44,240 --> 00:25:50,240
Let's actually work with the daily series here and.

187
00:25:50,240 --> 00:25:54,240
And just kind of do something a little out of the box here.

188
00:25:54,240 --> 00:26:00,240
If you were going to estimate the competitive.

189
00:26:00,240 --> 00:26:06,240
Return Massachusetts. OK, so.

190
00:26:06,240 --> 00:26:10,240
Similar analysis as.

191
00:26:10,240 --> 00:26:18,240
Well.

192
00:26:18,240 --> 00:26:26,240
Well.

193
00:26:26,240 --> 00:26:40,240
Well, yes, let's go ahead and try it and we may just have to do some data cleaning along the way, but you know that's the way we do it here. We know we just have to sometimes.

194
00:26:40,240 --> 00:26:44,240
And do this so.

195
00:26:44,240 --> 00:26:48,240
Let's go ahead and define.

196
00:26:48,240 --> 00:26:50,240
Why.

197
00:26:50,240 --> 00:26:58,240
As just production sales.

198
00:26:58,240 --> 00:27:04,240
Right, we're defining K as just the total number of.

199
00:27:04,240 --> 00:27:12,240
You know, that are tracked.

200
00:27:12,240 --> 00:27:27,240
And this case we were defining labor as hours worked, but now.

201
00:27:27,240 --> 00:27:36,240
Wait, let's just use just the total number of employees. I think we're just.

202
00:27:36,240 --> 00:27:42,240
Introduce less abstractions and we can use the daily series.

203
00:27:42,240 --> 00:27:58,240
So we'll just say labor is just the number of employees.

204
00:27:58,240 --> 00:28:09,240
This is the part that was giving me pause is we actually now need to exclude missing observations here, right, because.

205
00:28:09,240 --> 00:28:18,240
You know, say we define our variables.

206
00:28:18,240 --> 00:28:36,240
Labor is fine, except the.

207
00:28:36,240 --> 00:28:41,240
Interesting.

208
00:28:41,240 --> 00:28:48,240
That's right. So I'm real.

209
00:28:48,240 --> 00:28:58,240
This spike here at the end of capital and labor are giving me quite the pause.

210
00:28:58,240 --> 00:29:08,240
And I also essentially wanted to essentially drop this whole time period here where we're missing sales.

211
00:29:08,240 --> 00:29:12,240
So how are we going to go about doing this?

212
00:29:12,240 --> 00:29:15,240
Well,

213
00:29:15,240 --> 00:29:41,240
Let's find the periods where we're missing sales, right? So this is going to be why where why.

214
00:29:41,240 --> 00:29:57,240
Let's let's missing sales.

215
00:29:57,240 --> 00:30:11,240
You don't necessarily want to just drop any days that are just.

216
00:30:11,240 --> 00:30:30,240
You know, the reason it looks like, you know, there's not really sales through.

217
00:30:30,240 --> 00:30:42,240
And so this is the period I was primarily worried about March 26, 2020 missing sales through May.

218
00:30:42,240 --> 00:31:05,240
2020 and then, you know, it looks like there's a scattering.

219
00:31:05,240 --> 00:31:16,240
Just for the sake of time here, I think we're just going to exclude all the days that are missing sales. However.

220
00:31:16,240 --> 00:31:21,240
That may not necessarily be the best practice.

221
00:31:21,240 --> 00:31:28,240
If it's closed, you can't eliminate that data. I mean, you'd be creating data where it should not exist, right?

222
00:31:28,240 --> 00:31:37,240
Can you say that one more time? Like if the store is closed, some places around here, they're definitely open Christmas day, Christmas Eve or whatever.

223
00:31:37,240 --> 00:31:45,240
But if something is truly closed, inserting the data would not be allowed, right?

224
00:31:45,240 --> 00:32:09,240
Well, it's more just it appears for that day, we're just missing an observation. So maybe like the office was closed and then they just entered the data in on the 26th or what have you.

225
00:32:09,240 --> 00:32:17,240
My main thing is just don't want to include sort of these oddballs in the analysis.

226
00:32:17,240 --> 00:32:39,240
Because you're like when you're running a regression, if you know our output is zero and we have, you know, a certain number of plants and employees, well that may sort of bias the model in one way or the other when the sales weren't in fact zero.

227
00:32:39,240 --> 00:32:44,240
It's sort of my worry.

228
00:32:44,240 --> 00:33:00,240
So I just thought, okay, to avoid introducing bias for days that are coded as zero, but they're not actually zero, then I was just going to ignore those days.

229
00:33:00,240 --> 00:33:17,240
So, you know, luckily, you know, we're given, you know, a little more than 1000 days here. So

230
00:33:17,240 --> 00:33:42,240
actually, I was just going to exclude, you know, these 97 days from the analysis and saying, okay, you know, these days aren't your typical day. So let's, let's just do our analysis on, you know, your typical day, so to speak.

231
00:33:42,240 --> 00:33:52,240
I hope that answers your question Heather, but please chime up if you're still, you still have concerns.

232
00:33:52,240 --> 00:33:57,240
Nope, thank you.

233
00:33:57,240 --> 00:34:09,240
But it's just sort of a, you know, a step that I'm taking to try to remove outliers that may not be representative of the normal.

234
00:34:09,240 --> 00:34:14,240
So,

235
00:34:14,240 --> 00:34:17,240
so long story short,

236
00:34:17,240 --> 00:34:43,240
I'm just going to get these where we're not, you know, missing sales.

237
00:34:43,240 --> 00:34:58,240
Oh,

238
00:34:58,240 --> 00:35:13,240
Keegan, I'm sorry, I see my question now. Okay, so when excluding data like this, like, so for example, when I was fitting my crystal structures to the model or whatever,

239
00:35:13,240 --> 00:35:24,240
like it didn't always fit right, but then there are points that I could, I could eliminate because the light beam did not actually hit there, even though the light beam was there.

240
00:35:24,240 --> 00:35:33,240
So it's like when you eliminate data in this case, it's like data you're supposed to have, but that's not there or data that's truly not there.

241
00:35:33,240 --> 00:35:40,240
It's a very theoretical thing, but I'm not sure that I'm maybe understanding it right.

242
00:35:40,240 --> 00:35:59,240
So, in fact, there's a, there's a whole sort of, so in econometrics, they'll do studies about, okay, you know, what happens if there is measurement error or, you know, what happens if you do exclude a subset of data?

243
00:35:59,240 --> 00:36:22,240
And what happens if the subset you exclude is non-random? And so long story short is, yes, it does introduce bias. And so, yes, if you, if you just exclude a segment of the data, like we're doing now, or if you were going to

244
00:36:22,240 --> 00:36:35,240
have measurement error. So if we're going to, it's basically like you've got the bias one way or the other. So you've either got

245
00:36:35,240 --> 00:36:57,240
measurement error, where these days are coded as zero, and they shouldn't be, or you've got selection bias potentially, where we're just selecting a subset of the data for analysis.

246
00:36:57,240 --> 00:37:08,240
And by doing so, we may be introducing bias because you're right, there may be something special about the days that we're excluding.

247
00:37:08,240 --> 00:37:21,240
So, thank you. Thank you. I think that's well said because it's like it tells you how you can interpret this data, because at this point, like as of yesterday, last week, I wasn't sure if the model was right because alpha and beta being completely

248
00:37:21,240 --> 00:37:27,240
out of range, like when something's completely out of range, usually the means of model is wrong.

249
00:37:27,240 --> 00:37:44,240
True, true. Or, well, we had a wide confidence and we had a small number of observations, so we couldn't really say with much confidence what the parameters were going to be.

250
00:37:44,240 --> 00:38:01,240
So, another potential we could do is we could potentially, instead of just excluding the missing sales, we could potentially just look at this, you know, just from the last year.

251
00:38:01,240 --> 00:38:17,240
But then again, you know, it's saying, okay, well, then you know we're excluding all the data from before last year. So, that's just a quick way to wash your hands of the problem of selection.

252
00:38:17,240 --> 00:38:22,240
Sure to do that with everything in my life. Thank you so much. It sounds great.

253
00:38:22,240 --> 00:38:49,240
So, long story short, we're kind of stuck between a rock and a hard place where, and this is where I like to, you know, stress to people, you know, really like when you when you see someone present their data definitely ask these hard questions and push them on these because, you know, when you're doing any sort of data analytics, you run into these sort of

254
00:38:49,240 --> 00:39:08,240
conundrums and, you know, a lot of the times these solutions get done you know behind the scenes and they may not get talked about a lot, right. So like in our analysis, we've got this missing data gap.

255
00:39:08,240 --> 00:39:22,240
And, you know, that may not necessarily be, you know, readily apparent in whatever statistics we end up calculating.

256
00:39:22,240 --> 00:39:42,240
So, good point. And my only comment is just, just to power on so it's just sort of present your biases, just acknowledge them, be upfront about them.

257
00:39:42,240 --> 00:40:01,240
Emphasize them and then, you know, not necessarily but I'm under the belief that some statistics are better than no statistics. So we'll at least try to measure the alpha, so we can get the rate of return of capital.

258
00:40:01,240 --> 00:40:17,240
So we're going to hedge it that it may be just wildly, wildly inaccurate, but we're just sort of going to try to calculate it for a curiosity.

259
00:40:17,240 --> 00:40:22,240
Real quick here.

260
00:40:22,240 --> 00:40:50,240
I'm having a bit of trouble excluding these missing sales. So,

261
00:40:50,240 --> 00:41:05,240
let's see if we can't solve this real quick.

262
00:41:05,240 --> 00:41:25,240
Let's see if we can solve this.

263
00:41:25,240 --> 00:41:37,240
Okay, okay, this may not be the end of the world because I think maybe the plots just off. So,

264
00:41:37,240 --> 00:41:54,240
great. So this worked out okay. The plot is just that it's just confusing. Okay, so I'm sorry short. We're just going to try this real quick. YKL.

265
00:41:54,240 --> 00:42:09,240
So now we can actually define, right, output per labor. So that's Y per L.

266
00:42:09,240 --> 00:42:24,240
And then we're going to define output per capital.

267
00:42:24,240 --> 00:42:39,240
Like, exactly.

268
00:42:39,240 --> 00:42:57,240
So, we've got our fluctuation.

269
00:42:57,240 --> 00:43:02,240
Capital per labor is increasing.

270
00:43:02,240 --> 00:43:23,240
So that's, that's quite interesting.

271
00:43:23,240 --> 00:43:30,240
Next, we're going to be essentially trying to estimate alpha here.

272
00:43:30,240 --> 00:43:39,240
Right. And so remember, we have to log.

273
00:43:39,240 --> 00:43:47,240
So, here, we have to take a log of both sides.

274
00:43:47,240 --> 00:44:00,240
So,

275
00:44:00,240 --> 00:44:25,240
then we can find the log of capital.

276
00:44:25,240 --> 00:44:40,240
That's right. And we just have to add a constant, right, because we have our constant term, which is technology. We've got a

277
00:44:40,240 --> 00:45:03,240
technology augmented production function. So, let's just add a constant here.

278
00:45:03,240 --> 00:45:13,240
Oh, righty then. This should, I'm not sure why we're getting a bad syntax error, but

279
00:45:13,240 --> 00:45:33,240
we've got double equal signs. That would make sense. All right, looks like the regression was fit.

280
00:45:33,240 --> 00:45:58,240
But maybe it was not fit. So, it looks like, it looks like something went wrong here.

281
00:45:58,240 --> 00:46:10,240
I bet you may have exactly, see here with, whenever you take the log,

282
00:46:10,240 --> 00:46:19,240
if you take a log of a negative, you can't take a log of a negative, right. So, we've got some missing values for

283
00:46:19,240 --> 00:46:36,240
y here. So, why is that?

284
00:46:36,240 --> 00:46:52,240
Okay, so that's right.

285
00:46:52,240 --> 00:47:07,240
So, yeah, that's because we're missing sales on October 15th because we have to do the difference.

286
00:47:07,240 --> 00:47:22,240
All right, so there's several ways we can do this here. So, we can either just exclude the first observation, or we can start at a specific point in time. And so,

287
00:47:22,240 --> 00:47:39,240
we could potentially do both. So, here I'm just going to exclude just the first observation because it looks like the first observation is missing for sales. So,

288
00:47:39,240 --> 00:47:54,240
may not have quite done the trick. Interesting.

289
00:47:54,240 --> 00:48:18,240
Okay.

290
00:48:18,240 --> 00:48:39,240
So, this is a mess. So, what I'm going to say is, okay, we're just going to just, whatever was going on in this period, October 15th to what it looks like about,

291
00:48:39,240 --> 00:49:01,240
you know, the 20th of November is not consistent. So, yet again, we're going to exclude values, which, as Heather pointed out, introduces bias, right, because there may have been valuable information in that time period.

292
00:49:01,240 --> 00:49:28,240
But because it's just so messy, we're just going to have to restrict our data to after, to the 20th or after. So, how are we going to do that?

293
00:49:28,240 --> 00:49:53,240
So, we basically want all values of y where the index greater than or equal to the specific date.

294
00:49:53,240 --> 00:50:02,240
So, I'm going to promise this is going to work, but it looks like it may have.

295
00:50:02,240 --> 00:50:22,240
So, then we basically just want to repeat this for capital and labor.

296
00:50:22,240 --> 00:50:28,240
I have to restrict it to the first day anymore. Okay.

297
00:50:28,240 --> 00:50:30,240
Maybe now.

298
00:50:30,240 --> 00:50:44,240
But now again, you know, now I've gotten real, real uncertain about things. And in fact,

299
00:50:44,240 --> 00:51:08,240
we just estimated a negative value for alpha. So, this something went wrong.

300
00:51:08,240 --> 00:51:29,240
That may be for the best. So, let's see here. So, log y.

301
00:51:29,240 --> 00:51:40,240
So, our analysis may have gone.

302
00:51:40,240 --> 00:51:56,240
Unless we just have.

303
00:51:56,240 --> 00:52:14,240
For some reason.

304
00:52:14,240 --> 00:52:33,240
So, you know, this is where I kind of just want to say, you know, we may not want to just keep playing, you know, code pool here and, you know, we may just not have, you know, successfully estimated alpha in this manner.

305
00:52:33,240 --> 00:52:36,240
And that may be okay.

306
00:52:36,240 --> 00:52:54,240
Because, you know, last week, you know, we saw it. There's one more thing I'm going to try, but, you know, last week we saw that if we estimated this with two parameters, alpha and beta, we didn't have a constant returns to scale.

307
00:52:54,240 --> 00:52:58,240
We had decreasing returns to scale significantly.

308
00:52:58,240 --> 00:53:15,240
So, it may be that this assumption, you know, this assumed production function with constant returns to scale is not adequate or representing the market.

309
00:53:15,240 --> 00:53:26,240
Before we give up, the last thing I wanted to try was, okay, let's just, let's just try doing this.

310
00:53:26,240 --> 00:53:35,240
You know, like, let's just try like after, after May of 2020.

311
00:53:35,240 --> 00:53:40,240
So, let's just start in June of 2020.

312
00:53:40,240 --> 00:53:54,240
And just see if, if we start at this time, if the data is just a bit more.

313
00:53:54,240 --> 00:54:02,240
You know, let's outliers, no promises. So, let's just see.

314
00:54:02,240 --> 00:54:11,240
And let's also just say.

315
00:54:11,240 --> 00:54:24,240
We'll just try with this before we.

316
00:54:24,240 --> 00:54:26,240
And this is.

317
00:54:26,240 --> 00:54:44,240
Yeah, like I said, I think this analysis is just going to need to get tossed out because basically, once again, incredibly low R squared. So we're not really explaining much of the variation.

318
00:54:44,240 --> 00:55:00,240
Not really a significant coefficient here on alpha. It's close to one, which is not really value that you would expect for alpha.

319
00:55:00,240 --> 00:55:15,240
So, I hate to have just run us down this rabbit hole here, but sometimes this is, this is constructive.

320
00:55:15,240 --> 00:55:32,240
To see, you know, we attempted to estimate this model here, but I don't think we successfully did.

321
00:55:32,240 --> 00:55:48,240
And so that's okay. And just to show you essentially what we estimated last week and then why this may have broken down this, this week.

322
00:55:48,240 --> 00:55:52,240
So last week we were looking at the monthlies.

323
00:55:52,240 --> 00:56:04,240
And that's the other thing we could potentially try is, oh, you know, what happens if you measured labor in terms of, you know, hours worked.

324
00:56:04,240 --> 00:56:09,240
And so, you know, last week we looked at labor in terms of hours worked.

325
00:56:09,240 --> 00:56:17,240
You know, on a monthly level.

326
00:56:17,240 --> 00:56:22,240
And so here.

327
00:56:22,240 --> 00:56:34,240
We're estimating this same function, except for letting one minus alpha be its own parameter, beta.

328
00:56:34,240 --> 00:56:51,240
And so then we're just we're log linearizing and saying log y equals log a plus alpha log k plus beta log n.

329
00:56:51,240 --> 00:57:01,240
What we find when we do that is

330
00:57:01,240 --> 00:57:11,240
Yes, an alpha of around point two, a little less than point two, and then a beta of around point four or five.

331
00:57:11,240 --> 00:57:18,240
So that's if you add those together, you know, that's less than point six five.

332
00:57:18,240 --> 00:57:29,240
And so they would need to sum to one, you know, for there to be constant returns to scale here.

333
00:57:29,240 --> 00:57:44,240
Right. Because that's what this model is. Right. So if we assume that this is alpha, then, you know, one minus alpha, those have to add to one.

334
00:57:44,240 --> 00:57:54,240
And then in this case, they don't. So what that means is we've got decreasing returns to scale.

335
00:57:54,240 --> 00:58:07,240
So that so that would violate the assumption of constant returns.

336
00:58:07,240 --> 00:58:12,240
Can you say that one more time, please?

337
00:58:12,240 --> 00:58:19,240
I don't know. I said thank you. That was just really helpful to see what you guys did the last time.

338
00:58:19,240 --> 00:58:35,240
Yes.

339
00:58:35,240 --> 00:58:51,240
The derivative would be different.

340
00:58:51,240 --> 00:59:08,240
Well, maybe that was the lesson learned today that things don't always go as expected because I was essentially expecting this to to pan out and then compare the rate of return to what we estimated last week.

341
00:59:08,240 --> 00:59:15,240
And so

342
00:59:15,240 --> 00:59:28,240
Last week.

343
00:59:28,240 --> 00:59:32,240
Just in the last 12 months.

344
00:59:32,240 --> 00:59:51,240
Last week, we tried to estimate the rate of return, you know, essentially per plant. So and so this would actually be the cost per plant or, you know, what the investor would expect to get paid per plant.

345
00:59:51,240 --> 01:00:02,240
And this was with our first model. And so I was hoping to to estimate it again with this second model and compare the two.

346
01:00:02,240 --> 01:00:05,240
But we were we were unable.

347
01:00:05,240 --> 01:00:22,240
So, so basically, you know, we've tried to, you know, a couple different models now, and it seems that the only way that we're able to is monthly.

348
01:00:22,240 --> 01:00:36,240
With, you know, the number of employees, labor proxy as hours worked and capital proxy as the number of plants.

349
01:00:36,240 --> 01:00:55,240
And, you know, I would, you know, put this series hedge on these numbers, but you know, this is how you know the best we could do, or the best I can do with the data given. So

350
01:00:55,240 --> 01:00:59,240
So I think

351
01:00:59,240 --> 01:01:12,240
I think I'm going to think about this the next week because I was wanting to kind of try to tidy things up today and I think I essentially just uncovered more more loose ends.

352
01:01:12,240 --> 01:01:15,240
So,

353
01:01:15,240 --> 01:01:30,240
So perhaps for next week we can finally start to really hammer this down or turn directions to California or another state.

354
01:01:30,240 --> 01:01:50,240
But then that's another thing is I'm also going to ask a couple, you know, actual growers, you know, what is your, you know, cost per plant, you know, is your cost per planet, you know, around $100 or so like, does that sound reasonable?

355
01:01:50,240 --> 01:02:07,240
Or, you know, is, you know, $500 per plant, is that out of the ballpark? So, so that's another thing is I can start to actually try to get some anecdotal evidence to see if that could support some of our empiric evidence.

356
01:02:07,240 --> 01:02:10,240
So,

357
01:02:10,240 --> 01:02:14,240
So,

358
01:02:14,240 --> 01:02:29,240
So completely dissatisfied with the analysis for today but that's that's probably for the best right because it shows, you know, what needs to be touched up and re-looked at next week.

359
01:02:29,240 --> 01:02:37,240
But if any of you are interested, there's a lot more macroeconomics here to be had.

360
01:02:37,240 --> 01:02:39,240
And then,

361
01:02:39,240 --> 01:02:49,240
for next week, there's some interesting California data. So, you know, there's, you know, sales by county.

362
01:02:49,240 --> 01:02:57,240
So we can start looking at, you know, an economic analysis of California.

363
01:02:57,240 --> 01:03:12,240
So we don't have as many data points here, so it may be tough to make any, make any statistical determinations of any sort, but there's data to be had.

364
01:03:12,240 --> 01:03:18,240
So, you know, until next week, I'm sure,

365
01:03:18,240 --> 01:03:29,240
I may have left some of you unsettled with some of the analysis today. So if you've got any questions, you know, feel free to reach out and

366
01:03:29,240 --> 01:03:39,240
we can continue to, you know, think about and puzzle out this data and

367
01:03:39,240 --> 01:03:51,240
you know, if any of you have any inputs, definitely, you know, always open to, you know, new statistics that you may be interested in having talked about.

368
01:03:51,240 --> 01:04:00,240
So, but, you know, it's a look behind the curtain, right, because it's,

369
01:04:00,240 --> 01:04:19,240
you know, it's things don't always go as planned. And I think today was a good example of that. So, you know, you saw firsthand where we tried to go off road and calculate the daily rate of return and

370
01:04:19,240 --> 01:04:37,240
it wasn't fruitful. And I do kind of want to apologize for that, but you can't always force an outcome, right? Right? If we tried, we tried dropping the missing observations.

371
01:04:37,240 --> 01:04:51,240
And, you know, there's still more things that can be tried. So by all means, I would recommend to you repeat the analysis instead of using the total number of employees, use the total number of hours worked. So you can estimate that.

372
01:04:51,240 --> 01:05:08,240
Yeah, we'll go ahead and jump off, but long story short, thank you all for attending and I hope you learned something and stay in touch and we'll pick up next week.

373
01:05:08,240 --> 01:05:26,240
All right, all thanks again for coming. Thanks for attending the KAM State of Science Meetup group and until next week, keep your nose to the grindstone and keep having fun.

