1
00:00:00,000 --> 00:00:09,600
Welcome to the Azure Security Podcast, where we discuss topics relating to security, privacy,

2
00:00:09,600 --> 00:00:13,280
reliability and compliance on the Microsoft Cloud Platform.

3
00:00:13,280 --> 00:00:17,440
Hey everybody, welcome to Episode 91.

4
00:00:17,440 --> 00:00:22,280
This week is just myself, Michael, and our guest this week is Rigel Carlson, who's here

5
00:00:22,280 --> 00:00:25,820
to talk to us about Chaos Studio.

6
00:00:25,820 --> 00:00:30,680
Before we get to our guest, talk briefly about a couple of news items.

7
00:00:30,680 --> 00:00:35,800
One, we have just issued a document from the Microsoft Threat Intelligence team called

8
00:00:35,800 --> 00:00:40,400
Midnight Blizzard Guidance for Responders on Nation-State Attack.

9
00:00:40,400 --> 00:00:41,720
Very much worth reading.

10
00:00:41,720 --> 00:00:48,080
This is basically some more attacks coming out from a threat actor we often refer to

11
00:00:48,080 --> 00:00:49,080
as Nobelium.

12
00:00:49,080 --> 00:00:51,120
So please do go take a look at that.

13
00:00:51,120 --> 00:00:53,520
I will put a link in the show notes.

14
00:00:53,520 --> 00:00:56,040
The other one is an interesting one.

15
00:00:56,040 --> 00:01:01,040
Over the last few weeks or so, I've been doing quite a bit of development on Always Encrypted,

16
00:01:01,040 --> 00:01:04,400
which is a technology in Azure SQL Database and a SQL Server.

17
00:01:04,400 --> 00:01:08,360
I wrote some sample care and was experimenting and playing around, what have you.

18
00:01:08,360 --> 00:01:10,160
The first query took 15 seconds.

19
00:01:10,160 --> 00:01:11,600
I'm like, oh, that's not good.

20
00:01:11,600 --> 00:01:16,240
So I did a bit of digging around, finding why it's taking 15 seconds.

21
00:01:16,240 --> 00:01:19,000
Actually, I thought the problem was with Always Encrypted.

22
00:01:19,000 --> 00:01:20,600
It turns out it's not.

23
00:01:20,600 --> 00:01:26,760
The problem was actually with the way I acquired the Azure credentials to use Key Vault for

24
00:01:26,760 --> 00:01:27,760
the key storage.

25
00:01:27,760 --> 00:01:35,440
Basically, the way the code works inside of the Microsoft.data.sql client library is it

26
00:01:35,440 --> 00:01:40,640
tries to hold off doing all the work it needs to do until it actually has to do it.

27
00:01:40,640 --> 00:01:43,200
So it's quite lazy in that regard.

28
00:01:43,200 --> 00:01:47,600
And so when I go to do the first execute, like execute the first query, it basically

29
00:01:47,600 --> 00:01:51,640
does everything there, including going to SQL Server or SQL Database, pulling down a

30
00:01:51,640 --> 00:01:55,280
store procedure to find out what columns are encrypted, then going to Key Vault.

31
00:01:55,280 --> 00:01:57,080
All that stuff happens.

32
00:01:57,080 --> 00:01:58,680
And so that's why it takes a long time.

33
00:01:58,680 --> 00:02:05,600
And the other one is that it calls default.azure, sorry, default Azure credential.

34
00:02:05,600 --> 00:02:08,600
And that actually goes through a whole bunch of different credential providers to find

35
00:02:08,600 --> 00:02:10,800
out which one to use.

36
00:02:10,800 --> 00:02:12,560
And that can take a lot of time.

37
00:02:12,560 --> 00:02:15,840
So I put a couple of little tweets out there about my findings.

38
00:02:15,840 --> 00:02:20,960
I'm not saying don't use default Azure credential, but just be aware of the implications of using

39
00:02:20,960 --> 00:02:21,960
it.

40
00:02:21,960 --> 00:02:22,960
All right.

41
00:02:22,960 --> 00:02:26,160
So with that, let's now turn our attention to our guest.

42
00:02:26,160 --> 00:02:31,640
This week, as I mentioned, we have Rigel Carlson, who's here to talk to us about Chaos Studio.

43
00:02:31,640 --> 00:02:33,600
Rigel, hey, welcome to the podcast.

44
00:02:33,600 --> 00:02:37,080
We'd like to spend a moment and just introduce yourself to our listeners.

45
00:02:37,080 --> 00:02:38,080
Thanks for having me, Michael.

46
00:02:38,080 --> 00:02:39,080
I really appreciate it.

47
00:02:39,080 --> 00:02:40,400
So I'm Rigel.

48
00:02:40,400 --> 00:02:46,600
I am a product manager here at Microsoft working on Azure Chaos Studio.

49
00:02:46,600 --> 00:02:49,040
I've been at Microsoft about four years.

50
00:02:49,040 --> 00:02:54,180
I worked as well on the Windows deployment and update stack.

51
00:02:54,180 --> 00:03:01,120
So lots of fun stuff, both in the deployment stack and in Azure Chaos Studio.

52
00:03:01,120 --> 00:03:04,800
Really excited to be here talking with you today about Chaos Engineering.

53
00:03:04,800 --> 00:03:08,200
Well, first of all, congratulations on getting it out the door.

54
00:03:08,200 --> 00:03:11,520
It's obviously a huge milestone, even though it's taken you guys a little bit of a while

55
00:03:11,520 --> 00:03:15,080
to get it out of preview.

56
00:03:15,080 --> 00:03:16,080
How long has it been in preview?

57
00:03:16,080 --> 00:03:17,800
It's been a while, right?

58
00:03:17,800 --> 00:03:20,880
It was several years.

59
00:03:20,880 --> 00:03:28,240
So we, Chaos Studio has been around since about 2019.

60
00:03:28,240 --> 00:03:33,800
Chaos Engineering as a whole has been around for a much longer time.

61
00:03:33,800 --> 00:03:42,640
I believe it was popularized in the software world in 2011 when Netflix introduced their

62
00:03:42,640 --> 00:03:45,200
kind of internal Chaos Monkey tool.

63
00:03:45,200 --> 00:03:50,040
So when you say Chaos Engineering, that's often what a lot of folks will think of is

64
00:03:50,040 --> 00:03:59,080
this Chaos Monkey tool that Netflix built that basically just, it was pretty simple

65
00:03:59,080 --> 00:04:08,360
at first and it went off and killed instances of virtual machines running in production.

66
00:04:08,360 --> 00:04:14,440
This was, I believe, around when they were migrating to the cloud for the first time

67
00:04:14,440 --> 00:04:21,360
and they were hoping to test their resilience a little more effectively and started saying,

68
00:04:21,360 --> 00:04:27,880
hey, what if we just go and kill some instances of VMs in production and watch the results?

69
00:04:27,880 --> 00:04:34,080
It's become a much more common practice since then with cloud providers like us, like Azure,

70
00:04:34,080 --> 00:04:41,840
offering Chaos as a service and to a much greater extent than just kind of killing individual

71
00:04:41,840 --> 00:04:47,440
VMs or compute instances and startups entering the space as well.

72
00:04:47,440 --> 00:04:52,160
Now obviously from a security standpoint, I mean, this is a security podcast with a

73
00:04:52,160 --> 00:04:56,400
major focus on our cloud platforms.

74
00:04:56,400 --> 00:05:03,240
But if you look at Chaos Studio through a security lens, my guess, and correct me if

75
00:05:03,240 --> 00:05:09,080
I'm wrong here, but primarily we're focusing on availability and reliability and uptime

76
00:05:09,080 --> 00:05:10,880
and resilience and so on.

77
00:05:10,880 --> 00:05:18,720
So if you look at it from a security standpoint, you've got the classic CIA trifecta, confidentiality,

78
00:05:18,720 --> 00:05:20,480
integrity and availability.

79
00:05:20,480 --> 00:05:24,440
It sounds to me like Chaos Studio is really on the availability side.

80
00:05:24,440 --> 00:05:29,400
And if you're building, say, threat models or designing systems using stride, which is

81
00:05:29,400 --> 00:05:33,920
spoofing, tampering, repudiation, information disclosure, denial of service and elevation

82
00:05:33,920 --> 00:05:37,400
of privilege, it's the D, denial of service.

83
00:05:37,400 --> 00:05:39,280
Is that a fair comment?

84
00:05:39,280 --> 00:05:43,960
You're really focusing on the availability and reliability and mitigating denial of service

85
00:05:43,960 --> 00:05:44,960
issues?

86
00:05:44,960 --> 00:05:52,100
I think that's a good comparison and a good assessment that we're focusing on those issues

87
00:05:52,100 --> 00:06:01,800
where systems might not be available or they may be behaving in strange ways.

88
00:06:01,800 --> 00:06:10,160
I come from a systems engineering background before coming into the software world.

89
00:06:10,160 --> 00:06:20,920
And we think about how systems, these complex systems that the world is made up of, as systems

90
00:06:20,920 --> 00:06:28,480
get more and more complex, there's not one way that you can describe them.

91
00:06:28,480 --> 00:06:34,520
There's tons of relationships and feedback loops that make up these complex systems,

92
00:06:34,520 --> 00:06:43,800
whether it's societal or security systems or cloud reliability and cloud infrastructure.

93
00:06:43,800 --> 00:06:51,600
And they exhibit emergent behavior, which is the things you can't necessarily plan for,

94
00:06:51,600 --> 00:06:55,080
the behavior you can't plan for.

95
00:06:55,080 --> 00:07:00,540
So chaos testing, chaos engineering helps with some of those scenarios, whether it's

96
00:07:00,540 --> 00:07:07,920
in a security context or just sort of a cloud resilience context.

97
00:07:07,920 --> 00:07:17,400
I think focusing on availability, focusing on those denial of service scenarios is a

98
00:07:17,400 --> 00:07:21,000
good place to draw the parallel.

99
00:07:21,000 --> 00:07:31,000
And I think within chaos engineering as a whole, there are scenarios we focus on that

100
00:07:31,000 --> 00:07:38,680
chaos studio can help with like, okay, what happens if I'm experiencing a whole lot of

101
00:07:38,680 --> 00:07:41,760
resource pressure on my virtual machines?

102
00:07:41,760 --> 00:07:52,900
Or if a network connection is knocked out to certain IPs or certain ports, do I know

103
00:07:52,900 --> 00:08:02,040
what's going to happen to the rest of my system if that happens to occur?

104
00:08:02,040 --> 00:08:11,640
I think I was also in thinking about this podcast episode, I was looking a little into

105
00:08:11,640 --> 00:08:15,180
the security chaos engineering discipline.

106
00:08:15,180 --> 00:08:23,220
I think there are a lot of parallels between the non-security chaos engineering and security

107
00:08:23,220 --> 00:08:25,240
chaos engineering.

108
00:08:25,240 --> 00:08:31,280
There's an O'Reilly book on security chaos engineering by Kelly Shortridge and Aaron

109
00:08:31,280 --> 00:08:34,080
Reinhart that I was looking at a little bit.

110
00:08:34,080 --> 00:08:43,160
And one thing that stuck out to me was a quote about how cybersecurity must embrace the reality

111
00:08:43,160 --> 00:08:46,600
that failure will happen.

112
00:08:46,600 --> 00:08:52,400
And kind of goes on to talk about how people are going to click on the wrong things and

113
00:08:52,400 --> 00:08:59,360
security mitigations will be accidentally disabled, things will break and are breaking

114
00:08:59,360 --> 00:09:00,920
all the time.

115
00:09:00,920 --> 00:09:08,680
And that definitely aligns with how we here at working on chaos studio think about the

116
00:09:08,680 --> 00:09:13,320
world and recommend that folks test.

117
00:09:13,320 --> 00:09:19,640
I can also go a little into how our service works.

118
00:09:19,640 --> 00:09:27,800
So I mentioned we started kind of back in 2019 ish and we were in public preview for

119
00:09:27,800 --> 00:09:29,680
a few years.

120
00:09:29,680 --> 00:09:38,240
And just recently at Microsoft Ignite in November, we brought chaos studio into general availability.

121
00:09:38,240 --> 00:09:45,040
But we've had quite a few customers using us in the public preview phase.

122
00:09:45,040 --> 00:09:55,840
So Azure chaos studio is a managed Azure service that works to measure and understand and build

123
00:09:55,840 --> 00:10:01,220
customers resilience to different real world outages.

124
00:10:01,220 --> 00:10:09,520
So like I talked about with chaos engineering as a whole, kind of being a way to test resilience

125
00:10:09,520 --> 00:10:14,160
by breaking things with fault injection.

126
00:10:14,160 --> 00:10:24,680
Chaos Studio lets you do that for Azure services in a more integrated way by providing those

127
00:10:24,680 --> 00:10:34,820
connections to virtual machines to Azure Kubernetes service to key vault and providing various

128
00:10:34,820 --> 00:10:42,080
faults that can mess with those services or mess with kind of your configuration of those

129
00:10:42,080 --> 00:10:43,340
services.

130
00:10:43,340 --> 00:10:46,760
We have a couple different ways that that can happen.

131
00:10:46,760 --> 00:10:54,000
So we have faults that are pretty straightforward and just talking to another service, making

132
00:10:54,000 --> 00:10:58,200
some API calls like let's take virtual machines as an example.

133
00:10:58,200 --> 00:11:03,080
If you're running a whole bunch of compute in Azure using virtual machines, virtual machine

134
00:11:03,080 --> 00:11:12,640
scale sets, you may not have tested how your application and your system as a whole behaves

135
00:11:12,640 --> 00:11:19,680
when some subset of those virtual machines go down for some reason.

136
00:11:19,680 --> 00:11:29,080
Chaos Studio can help you do that by giving you the tools to set up an experiment, select

137
00:11:29,080 --> 00:11:35,360
and onboard all of those virtual machines that you might want to test and abruptly shut

138
00:11:35,360 --> 00:11:37,600
them down.

139
00:11:37,600 --> 00:11:40,640
Now you may be thinking, you know, okay, just shutting down VMs.

140
00:11:40,640 --> 00:11:44,300
I can go into Azure portal and do that myself.

141
00:11:44,300 --> 00:11:53,720
The value of Chaos Studio comes into play by kind of orchestrating that scenario and

142
00:11:53,720 --> 00:11:55,960
that fault with other faults.

143
00:11:55,960 --> 00:12:03,160
So you may want to do that in sequence or in parallel with other actions happening.

144
00:12:03,160 --> 00:12:08,360
Maybe I want to know what's happening when all of the virtual machines in a certain zone

145
00:12:08,360 --> 00:12:11,980
are taken out and they're no longer available.

146
00:12:11,980 --> 00:12:18,560
And I also, you know, in parallel I see a whole bunch of resource pressure on, you know,

147
00:12:18,560 --> 00:12:24,000
CPU or memory pressure on some other subset of my compute.

148
00:12:24,000 --> 00:12:31,560
And maybe also my, you know, Cosmos DB account is failing over between regions.

149
00:12:31,560 --> 00:12:42,280
So it's building up those more complex failure scenarios that is where Chaos Engineering

150
00:12:42,280 --> 00:12:48,720
kind of comes to the forefront and where Chaos Studio can really help.

151
00:12:48,720 --> 00:12:54,640
So I talked a little about those service direct faults where we're talking directly to other

152
00:12:54,640 --> 00:12:56,140
Azure services.

153
00:12:56,140 --> 00:13:02,760
We have a Chaos Agent, which is, you know, a small piece of software that you can onboard

154
00:13:02,760 --> 00:13:11,700
to virtual machines and cause, you know, other issues within the virtual machine like that

155
00:13:11,700 --> 00:13:16,720
resource pressure or network disruption, network latency.

156
00:13:16,720 --> 00:13:23,720
Those can all be really important for just resilience scenarios or even security scenarios.

157
00:13:23,720 --> 00:13:26,560
I know you mentioned Key Vault earlier.

158
00:13:26,560 --> 00:13:36,200
You know, you were having some issues testing out some encryption with Key Vault infrastructure.

159
00:13:36,200 --> 00:13:43,480
We've had a lot of customers use some Key Vault faults that basically deny access for

160
00:13:43,480 --> 00:13:50,360
a certain period of time to Key Vault or, you know, see what happens when you go ahead

161
00:13:50,360 --> 00:13:58,720
and update certificates or lose access to certain Key Vault instances.

162
00:13:58,720 --> 00:14:03,120
So definitely something that Chaos Studio can help with.

163
00:14:03,120 --> 00:14:11,380
And then internally, we also have a couple other methods of fault injection.

164
00:14:11,380 --> 00:14:19,780
We do perform chaos testing internally on some of the Azure infrastructure.

165
00:14:19,780 --> 00:14:28,380
So we have some teams within Azure that, you know, work with us on and use our tooling

166
00:14:28,380 --> 00:14:34,920
to test, you know, what happens if this, you know, Azure infrastructure starts experiencing

167
00:14:34,920 --> 00:14:38,920
issues and are we able to deal with that from a resilience point of view.

168
00:14:38,920 --> 00:14:44,560
I know I went off on a bit of a tangent there, but I wanted to get a few of our fault types

169
00:14:44,560 --> 00:14:45,560
covered.

170
00:14:45,560 --> 00:14:50,160
You said about basically playing around with certificates, like rotating a certificate

171
00:14:50,160 --> 00:14:51,160
out.

172
00:14:51,160 --> 00:14:52,160
Can you do that?

173
00:14:52,160 --> 00:15:01,240
So the Key Vault faults that we support are, yeah, so we have Key Vault access denial.

174
00:15:01,240 --> 00:15:07,400
So basically blocking all of the network access to a certain Key Vault for a period of time.

175
00:15:07,400 --> 00:15:14,380
There's disabling a certificate for a specified duration and then re-enabling it, incrementing

176
00:15:14,380 --> 00:15:20,040
a certificate version or just generally updating a certificate policy.

177
00:15:20,040 --> 00:15:24,720
That's what we support for Key Vault and that may be, you know, may be useful for security

178
00:15:24,720 --> 00:15:25,720
scenarios.

179
00:15:25,720 --> 00:15:30,800
Yeah, yeah, because, you know, certificates can get rolled underneath you.

180
00:15:30,800 --> 00:15:32,440
So a couple of questions.

181
00:15:32,440 --> 00:15:36,440
First of all, if someone were to use Chaos Studio, obviously it's going to start causing

182
00:15:36,440 --> 00:15:39,160
all sorts of havoc in their environment.

183
00:15:39,160 --> 00:15:43,640
Does that mean that Azure needs to be aware, like someone within Microsoft or the Azure

184
00:15:43,640 --> 00:15:48,480
infrastructure or personnel needs to know that you're using this when all of a sudden,

185
00:15:48,480 --> 00:15:52,000
you know, alerts start going off and things start failing?

186
00:15:52,000 --> 00:15:55,040
Or do you not need to do anything special if you're going to start using this?

187
00:15:55,040 --> 00:15:56,480
That's a great question.

188
00:15:56,480 --> 00:16:00,240
Yeah, so nothing special is needed.

189
00:16:00,240 --> 00:16:07,120
The nice thing about Chaos Studio and, you know, one of the principles that we built

190
00:16:07,120 --> 00:16:19,080
it on was, you know, giving, providing customers with the tools to do this controlled chaos,

191
00:16:19,080 --> 00:16:23,920
especially within a customer perspective, we're not necessarily taking that approach

192
00:16:23,920 --> 00:16:31,440
that I mentioned with, you know, Netflix's initial foray into chaos engineering where

193
00:16:31,440 --> 00:16:36,440
they were just going off and shutting off random instances in production.

194
00:16:36,440 --> 00:16:43,880
We take a little more controlled approach in that customers need to, you know, come

195
00:16:43,880 --> 00:16:51,400
to come to Azure, come to Chaos Studio, they need to explicitly onboard the resources that

196
00:16:51,400 --> 00:16:52,960
they want to affect.

197
00:16:52,960 --> 00:17:01,480
So whether that's virtual machines or a Cosmos DB account or their, you know, key vault resource,

198
00:17:01,480 --> 00:17:07,760
a customer does need to explicitly onboard all of those resources into Chaos Studio.

199
00:17:07,760 --> 00:17:16,440
They also need to have the permissions to perform certain actions against those resources.

200
00:17:16,440 --> 00:17:24,600
We're, you know, we are built around the Azure Resource Manager, the role based access control

201
00:17:24,600 --> 00:17:33,000
model that, you know, folks are familiar with within Azure and everything goes through that

202
00:17:33,000 --> 00:17:34,440
RBAC model.

203
00:17:34,440 --> 00:17:41,040
That means we're not doing this random chaos, so you do need to be a little more intentional

204
00:17:41,040 --> 00:17:42,040
about it.

205
00:17:42,040 --> 00:17:48,520
But we, you know, we see that as a good thing that customers need to be, you know, intentional

206
00:17:48,520 --> 00:17:53,480
and planning out the scenarios that they want to cover.

207
00:17:53,480 --> 00:17:56,120
That's an interesting point about planning the scenarios.

208
00:17:56,120 --> 00:18:01,960
I imagine in many organizations, people are not necessarily experts at chaos engineering.

209
00:18:01,960 --> 00:18:05,680
So if I was given a scenario, I don't know, some environment, let's just make it up.

210
00:18:05,680 --> 00:18:11,320
You know, it's a browser talking to, you know, an Azure app of some kind, say, and Azure

211
00:18:11,320 --> 00:18:18,400
function that then talks to Azure SQL database and Redis cache and key vault.

212
00:18:18,400 --> 00:18:24,720
I mean, if I'm given an environment, I mean, I'm not necessarily going to know what things

213
00:18:24,720 --> 00:18:26,360
to do.

214
00:18:26,360 --> 00:18:30,320
Does the tool help, like come up with experiments?

215
00:18:30,320 --> 00:18:42,520
We have a new feature that just recently released around our GA timeframe and as part of our

216
00:18:42,520 --> 00:18:50,560
general availability called templates that provides rather than being dropped into just

217
00:18:50,560 --> 00:19:00,240
a blank chaos experiment with no faults or actions kind of pre-populated.

218
00:19:00,240 --> 00:19:07,000
We're giving a little, you know, a little quick start for customers to jump into certain

219
00:19:07,000 --> 00:19:09,200
common scenarios.

220
00:19:09,200 --> 00:19:16,800
The two that we have within the templates interface right now are an Azure Active Directory

221
00:19:16,800 --> 00:19:24,720
outage for virtual machines and virtual machine scale sets and availability zone down where

222
00:19:24,720 --> 00:19:30,440
we abruptly shut down VM scale sets within a certain availability zone.

223
00:19:30,440 --> 00:19:32,200
So that helps a little bit.

224
00:19:32,200 --> 00:19:37,800
We definitely are looking for, you know, now that we're GA, we will be ramping up, you

225
00:19:37,800 --> 00:19:43,840
know, the amount of samples that we provide for various, you know, various use cases and

226
00:19:43,840 --> 00:19:49,680
configurations, building out that template library and of course, you know, adding more

227
00:19:49,680 --> 00:19:52,120
faults in general to our library.

228
00:19:52,120 --> 00:19:58,800
Of course, you know, over the long term, we will look into additional ways to, you know,

229
00:19:58,800 --> 00:20:05,560
provide more intelligent recommendations on what sort of scenarios to run, what sort of

230
00:20:05,560 --> 00:20:12,800
experiments to run and as well as, you know, other integrations across Azure.

231
00:20:12,800 --> 00:20:15,920
Can you, I mean, when you said there's an outage, can you like blip something so it

232
00:20:15,920 --> 00:20:19,640
just blips for a second or like goes offline for a split second and then comes back or

233
00:20:19,640 --> 00:20:23,400
is it really a lengthy bit of downtime?

234
00:20:23,400 --> 00:20:30,320
It really depends on the fault and kind of the, you know, the scenario to cover.

235
00:20:30,320 --> 00:20:39,880
You can perform shorter duration network, like network faults, whether that's kind of

236
00:20:39,880 --> 00:20:46,200
disconnecting certain traffic or introducing packet loss and latency.

237
00:20:46,200 --> 00:20:54,400
We're looking into sort of other possible blips and pauses, but I think the network

238
00:20:54,400 --> 00:21:00,520
latency, disconnect, packet loss, that's probably all the, you know, the closest we can get.

239
00:21:00,520 --> 00:21:08,140
We also have one common scenario that we recommend to customers is using our network security

240
00:21:08,140 --> 00:21:16,840
group rules fault to affect a broader range of services than, you know, than we have explicit

241
00:21:16,840 --> 00:21:17,980
faults for.

242
00:21:17,980 --> 00:21:35,660
So some people, you know, they'll come to Chaos Studio and see, you know, okay, you don't have any faults listed for say entirely disconnecting my like Cosmos DB instance, or you don't have any faults listed for SQL at this time.

243
00:21:35,660 --> 00:22:01,840
And so what we can recommend to them is we have a fault that can create some network security group rules for a short time or for a specified time and do things like, okay, I want to block all of the traffic to a certain Azure service.

244
00:22:01,840 --> 00:22:08,360
And it supports Azure service tags. So you can use, you can use those service tags to

245
00:22:08,360 --> 00:22:15,480
say, and I don't know if I'll remember the tag correctly, but you could say, you know,

246
00:22:15,480 --> 00:22:25,120
Azure Cosmos DB dot East US and, you know, handily all of the IPs associated with Cosmos

247
00:22:25,120 --> 00:22:28,280
DB and East US are covered by that service tag.

248
00:22:28,280 --> 00:22:32,800
You can pretty easily block all that traffic. So that's another thing that we another method

249
00:22:32,800 --> 00:22:36,000
that we often recommend to customers.

250
00:22:36,000 --> 00:22:39,440
Do you have any details about like, what the most common like little things people do?

251
00:22:39,440 --> 00:22:44,880
Like is the top couple of things that everyone everyone does as an experiment or part of an

252
00:22:44,880 --> 00:22:45,880
experiment?

253
00:22:45,880 --> 00:22:47,320
Yeah, that's a good question.

254
00:22:47,320 --> 00:23:11,280
I would say our most common scenarios that we see are virtual machine based. So whether it's shutting down virtual machines and virtual machine scale sets or using the using those agent faults that I mentioned on on virtual machines, those are really common.

255
00:23:11,280 --> 00:23:18,240
And then the other the other scenario that is quite popular is using our integration with

256
00:23:18,240 --> 00:23:30,840
AKS chaos mesh. So chaos mesh is a an open source framework for Kubernetes chaos engineering.

257
00:23:30,840 --> 00:23:45,680
And it provides faults like network, you know, network disruption, pod kill and pod disruption, various stress faults, HTTP, you know, all the good stuff.

258
00:23:45,680 --> 00:23:56,200
And rather than reinvent the wheel, we built a way that customers can, you know, start chaos mesh faults from chaos studio.

259
00:23:56,200 --> 00:24:02,120
And we have some tutorials on how to do this that, you know, I can share in the in the show notes.

260
00:24:02,120 --> 00:24:04,320
But that's that's another popular scenario.

261
00:24:04,320 --> 00:24:10,920
Kubernetes is obviously a very common, very common part of many applications infrastructure.

262
00:24:10,920 --> 00:24:17,720
So having having some integration there with chaos mesh has has been important for many customers.

263
00:24:17,720 --> 00:24:24,960
You know, it's interesting, I had a customer some years ago and they had a big web presence, retail presence.

264
00:24:24,960 --> 00:24:28,000
And by the way, I'm going somewhere with this story.

265
00:24:28,000 --> 00:24:33,080
And they they found that their usage for as a key vault, their bill was actually pretty high and they couldn't work out why.

266
00:24:33,080 --> 00:24:41,680
Well, the reason was every time they made a connection or someone made a connection to their website, they would go and hit up key vault to pull some data down.

267
00:24:41,680 --> 00:24:48,040
The problem with that, of course, is not only, you know, first of all, Key Vault is not really a transactional service at all.

268
00:24:48,040 --> 00:24:51,520
You know, you're not supposed to be hitting it thousands of times a second, which is what they were doing.

269
00:24:51,520 --> 00:24:57,200
Rather what they and then you end up getting timeouts, which Key Vault does by default.

270
00:24:57,200 --> 00:25:05,200
And so what they ended up doing, which is like caching information for 30 minutes and then so basically they're hitting key vaults every 30 minutes asynchronously.

271
00:25:05,200 --> 00:25:12,480
And so not only did their their car, their key vault cost go down, their performance went up, but also their reliability went up, right?

272
00:25:12,480 --> 00:25:17,760
Because they weren't so dependent on Key Vault being there thousands of times a second.

273
00:25:17,760 --> 00:25:24,080
This wasn't found through through any kind of anything other than just someone saying, why do you do this?

274
00:25:24,080 --> 00:25:25,880
But I'm sure you see things like that, right?

275
00:25:25,880 --> 00:25:33,480
Where people are just, you know, they're on Kailh Studio and they say, wow, you know, why does our application go down because of that one thing?

276
00:25:33,480 --> 00:25:38,600
You know, that scenario happened and then end up changing their their their design.

277
00:25:38,600 --> 00:25:41,680
So this is the question coming out of all this.

278
00:25:41,680 --> 00:25:50,360
So, I mean, what sort of changes do you see people make to their designs to make them more robust in the face of Kailh Studio or, you know, intermittent outages?

279
00:25:50,360 --> 00:25:56,760
So I think actually Key Vault is a is an example that that I would mention here, too.

280
00:25:56,760 --> 00:26:09,800
We had a some internal teams, one of the case studies listed actually on our product page on Azure dot Microsoft dot com.

281
00:26:09,800 --> 00:26:25,120
We have a video that that talks through some case studies and one of those case studies is from an internal team who who has done testing with some of these key vault faults.

282
00:26:25,120 --> 00:26:32,400
They also they also made an appearance that alongside us at some conferences last year.

283
00:26:32,400 --> 00:26:37,040
So I can share some of the some of the links to that in the notes as well.

284
00:26:37,040 --> 00:26:44,880
I believe so I don't I don't recall exact details on kind of the, you know, the changes that they made to their their infrastructure.

285
00:26:44,880 --> 00:26:57,840
But they found issues relating to, you know, Key Vault and how they were treating certain failure scenarios and were able to get those remedied.

286
00:26:57,840 --> 00:27:10,280
The the other other scenario I would I believe we see is not handling a virtual machine outages, you know, as expected.

287
00:27:10,280 --> 00:27:20,040
Basically, you know, the availability zone scenario has been has been quite important testing what happens when all of all of the virtual machines,

288
00:27:20,040 --> 00:27:28,440
virtual machine scale sets in an availability zone are are out and abruptly shut down.

289
00:27:28,440 --> 00:27:34,960
We've had some some internal teams go through that testing and see issues with with their infrastructure,

290
00:27:34,960 --> 00:27:42,200
you know, handling that handling that sort of case and have been able to sense sense fix them.

291
00:27:42,200 --> 00:27:45,720
I think those are the main examples that I would that I would draw on there.

292
00:27:45,720 --> 00:27:56,800
But, you know, it really varies since it's quite quite dependent on infrastructure and how how your workload is set up and and all that good stuff.

293
00:27:56,800 --> 00:28:00,720
So you mentioned earlier about essentially authorization policies.

294
00:28:00,720 --> 00:28:07,720
You know, you don't want every Tom, Dick and Harry just running amok inside of people subscriptions using K.L. Studio.

295
00:28:07,720 --> 00:28:14,880
So sort of what level do you restrict who can do what using K.L. Studio?

296
00:28:14,880 --> 00:28:24,480
Yeah, that's a great question. So two aspects to this one is, you know, permissions to use K.L. Studio.

297
00:28:24,480 --> 00:28:39,080
So we have, you know, real based access control policies for being able to create chaos experiments on board resources as targets, start chaos experiments.

298
00:28:39,080 --> 00:28:52,600
So, you know, you can control do I want these people or these identities in my organization being able to actually even work with Azure K.L. Studio.

299
00:28:52,600 --> 00:29:00,480
The other component then is actually executing faults against resources.

300
00:29:00,480 --> 00:29:14,520
And the so we have a an identity, either a system assigned managed identity or a user assigned managed identity that is attached to the chaos experiment.

301
00:29:14,520 --> 00:29:22,440
And that identity needs to have the proper permissions for each individual resource that we're targeting.

302
00:29:22,440 --> 00:29:36,320
So if it's a virtual machine, you know, a virtual machine contributor access and those are all listed out on kind of our product page and can be automatically assigned to the identity.

303
00:29:36,320 --> 00:29:42,400
If you assuming that you have the permissions and if you if you so desire.

304
00:29:42,400 --> 00:30:00,160
So that's kind of our our role based access control model kind of, you know, restricting both sides, whether you sort of who can use chaos, but also what are the how do you restrict what resources can actually be affected.

305
00:30:00,160 --> 00:30:10,280
So on the flip side of that, so does K.L. Studio work well in kind of isolated environments with, say, private endpoints, V net injection, that kind of stuff?

306
00:30:10,280 --> 00:30:17,640
Yeah, so you can we have had some recent feature additions in this in this area.

307
00:30:17,640 --> 00:30:23,760
You can use you can use some of our our fault capabilities with with private networking.

308
00:30:23,760 --> 00:30:35,160
So, for example, a chaos, you can perform those chaos mesh faults against a an a case cluster that is private.

309
00:30:35,160 --> 00:30:39,560
And we have a tutorial on how to do that in our documentation.

310
00:30:39,560 --> 00:30:46,400
And then we just recently added private link support for agent scenarios as well.

311
00:30:46,400 --> 00:31:03,520
So, you know, allowing our chaos agent to talk to, you know, talk back to the experiment and the orchestration infrastructure while still, you know, staying secure within within within private link.

312
00:31:03,520 --> 00:31:06,720
All right. I think we're probably time to bring this this episode to an end.

313
00:31:06,720 --> 00:31:13,960
So, Roger, one question we always ask our guests is if you had just one final thought to leave our listeners with, what would it be?

314
00:31:13,960 --> 00:31:26,840
My one final thought, I think, is think about how your systems fail, these complex systems that, you know, that we work with in the cloud.

315
00:31:26,840 --> 00:31:33,240
Do you know what emergent behavior you might see when when unexpected outages happen?

316
00:31:33,240 --> 00:31:35,680
And then how can you go and test it?

317
00:31:35,680 --> 00:31:38,560
And test it using chaos studio, right?

318
00:31:38,560 --> 00:31:42,480
Ideally, but, you know, we're all for chaos engineering as a discipline.

319
00:31:42,480 --> 00:31:44,200
We would we would love it.

320
00:31:44,200 --> 00:31:46,160
You know, came came to use us.

321
00:31:46,160 --> 00:31:53,320
Yeah, because we had some some links to the show notes about chaos engineering just in general, I think that's a really interesting area.

322
00:31:53,320 --> 00:31:55,200
So, yeah, we'll definitely go ahead and do that.

323
00:31:55,200 --> 00:31:57,640
Hey, look, look, Roger, thank you so much for joining us this week.

324
00:31:57,640 --> 00:32:01,640
I always learn something on these podcast episodes, and this is absolutely no exception.

325
00:32:01,640 --> 00:32:10,880
And I'm a huge fan of chaos studio, specifically in chaos engineering, just in general, because, you know, some things sometimes things don't go the way you expect.

326
00:32:10,880 --> 00:32:16,240
And that's that's always good to at least have a better idea of what you know, what might happen if things do go awry.

327
00:32:16,240 --> 00:32:20,040
And to all our listeners out there, thank you so much for joining us this week.

328
00:32:20,040 --> 00:32:21,600
We hope you found this episode of use.

329
00:32:21,600 --> 00:32:23,800
Stay safe and we'll see you next time.

330
00:32:23,800 --> 00:32:26,680
Thanks for listening to the Azure Security Podcast.

331
00:32:26,680 --> 00:32:33,480
You can find show notes and other resources at our Web site, azsecuritypodcast.net.

332
00:32:33,480 --> 00:32:38,240
If you have any questions, please find us on Twitter at Azure Set Pod.

333
00:32:38,240 --> 00:32:57,240
Background music is from CC Mixter dot com and licensed under the Creative Commons license.

