1
00:00:00,000 --> 00:00:10,000
Welcome to Artificially Intelligent Marketing, a weekly podcast where we stay on top of the

2
00:00:10,000 --> 00:00:15,760
latest trends, tips and tools in the world of marketing AI, helping you get the best

3
00:00:15,760 --> 00:00:18,560
results from your marketing efforts.

4
00:00:18,560 --> 00:00:22,960
Now let's join our hosts, Paul Avery and Martin Broadhurst.

5
00:00:22,960 --> 00:00:24,320
Hello there, folks.

6
00:00:24,320 --> 00:00:27,540
Welcome to episode 42 of Artificially Intelligent Marketing.

7
00:00:27,540 --> 00:00:32,280
We've got a cracker this week, and as usual, I'm joined by my very good friend Martin Broadhurst

8
00:00:32,280 --> 00:00:35,040
who's going to take us through the AI news this week.

9
00:00:35,040 --> 00:00:37,040
How are you, Martin?

10
00:00:37,040 --> 00:00:38,640
Yeah, splendid.

11
00:00:38,640 --> 00:00:42,960
Glad to see your face staring back at me across the interwebs.

12
00:00:42,960 --> 00:00:48,280
Well, I'm sorry that you had to put up with that, but hopefully I've got some interesting

13
00:00:48,280 --> 00:00:51,960
things to say about the AIs, which will be interesting for you and hopefully interesting

14
00:00:51,960 --> 00:00:54,160
for our listeners as well.

15
00:00:54,160 --> 00:00:55,160
We've got so much to cover.

16
00:00:55,160 --> 00:00:57,560
We're going to have to just jump straight in, Martin.

17
00:00:57,560 --> 00:01:00,440
Get us started with Claude 3.

18
00:01:00,440 --> 00:01:06,960
The big news of the week, we have an official GPT-4 killer, at least according to the benchmarks

19
00:01:06,960 --> 00:01:08,320
anyway.

20
00:01:08,320 --> 00:01:14,080
Anthropic have released the latest family of the Claude model, so it comes in three

21
00:01:14,080 --> 00:01:15,080
flavours.

22
00:01:15,080 --> 00:01:23,200
We've got Haiku, the smallest, Sonnet, the mid-range model, and Opus, the largest model

23
00:01:23,200 --> 00:01:33,280
that is the kind of GPT-4 Google Gemini Ultra or, well, we don't know how it compares against

24
00:01:33,280 --> 00:01:36,080
Google Gemini Pro 1.5.

25
00:01:36,080 --> 00:01:39,480
Yeah, God, these names, Paul, these names.

26
00:01:39,480 --> 00:01:40,480
They're killing me.

27
00:01:40,480 --> 00:01:42,440
It don't work.

28
00:01:42,440 --> 00:01:47,240
But they are very good, very fast, very intelligent.

29
00:01:47,240 --> 00:01:51,000
So what do we get actually under the hood and what are they saying about them?

30
00:01:51,000 --> 00:01:58,680
So they've got new improved capabilities across understanding, fluency, complex task management.

31
00:01:58,680 --> 00:02:07,740
All models, so all three sizes, have vision capabilities, which is pretty cool.

32
00:02:07,740 --> 00:02:14,160
They've also talked about potential new features such as tool use, also called function calling

33
00:02:14,160 --> 00:02:20,320
on some of the other models, as well as interactive coding, so where the model can actually read

34
00:02:20,320 --> 00:02:27,640
and run bits of script and what they've called, and this is the first time I've heard this,

35
00:02:27,640 --> 00:02:36,680
advanced agentic capabilities, basically helping it to become an AI agent, which as we know

36
00:02:36,680 --> 00:02:40,060
is the future of this kind of technology.

37
00:02:40,060 --> 00:02:46,160
In terms of the actual model itself and interacting with the models, some changes from Claude

38
00:02:46,160 --> 00:02:53,160
2.1, they say that it has fewer refusals.

39
00:02:53,160 --> 00:02:58,800
So what you used to get because of the guardrails that they put around Claude 2.1 is you might

40
00:02:58,800 --> 00:03:06,880
ask it to write you a script where the main protagonist of the story is a cyber crime

41
00:03:06,880 --> 00:03:14,000
hacker, and it would say, I can't write about things that are illegal or immoral.

42
00:03:14,000 --> 00:03:19,380
Whereas now the model recognizes that you're not asking it to hack somebody's system or

43
00:03:19,380 --> 00:03:24,720
give you a how-to guide on breaking into the Pentagon, you're asking for a story.

44
00:03:24,720 --> 00:03:28,760
So it recognizes that and it will actually assist you now.

45
00:03:28,760 --> 00:03:34,760
So far fewer of those refusals that you would get on the old model.

46
00:03:34,760 --> 00:03:43,880
It has increased accuracy and a better citation feature, which is coming in as well, and they've

47
00:03:43,880 --> 00:03:48,920
improved its ability to recognize when it doesn't know an answer.

48
00:03:48,920 --> 00:03:54,840
This is very much when you look at the benchmark tests for this, this is very much on the Opus

49
00:03:54,840 --> 00:03:56,640
model.

50
00:03:56,640 --> 00:04:05,480
Sonnet compares pretty much like Claude 2.1, not a huge amount in it in terms of that they

51
00:04:05,480 --> 00:04:06,800
basically has three scores.

52
00:04:06,800 --> 00:04:07,800
Does it give a correct answer?

53
00:04:07,800 --> 00:04:09,840
Does it give an incorrect answer?

54
00:04:09,840 --> 00:04:13,560
And if it doesn't know, does it say it doesn't know?

55
00:04:13,560 --> 00:04:20,760
Then the Opus model is the best across the board for the number of incorrect answers.

56
00:04:20,760 --> 00:04:23,600
The number of correct answers is by far the highest.

57
00:04:23,600 --> 00:04:30,440
And it does say it doesn't know more frequently rather than just hallucinating the response.

58
00:04:30,440 --> 00:04:32,240
It also has advanced recall capabilities.

59
00:04:32,240 --> 00:04:38,400
So they've done a needle in a haystack test where you get the big amount of text.

60
00:04:38,400 --> 00:04:46,000
Use that large context window, insert a random snippet of text and then ask the model to find

61
00:04:46,000 --> 00:04:49,960
that and they put it at the beginning, the middle and the end to see how it does in terms

62
00:04:49,960 --> 00:04:53,480
of recalling this piece of information.

63
00:04:53,480 --> 00:04:57,800
And this test threw up something interesting that's not been seen on any other models where

64
00:04:57,800 --> 00:05:03,760
it identified the correct piece of information and then it responded by saying, this sentence

65
00:05:03,760 --> 00:05:08,920
seems very out of place and unrelated to the rest of the content in the documents, which

66
00:05:08,920 --> 00:05:13,800
are about programming languages, startups and finding work you love.

67
00:05:13,800 --> 00:05:19,260
I suspect this pizza topic fact may have been inserted as a joke or to testify I was paying

68
00:05:19,260 --> 00:05:24,840
attention since it does not fit with the other topics at all.

69
00:05:24,840 --> 00:05:30,880
So it's got an awareness that the user is trying to trick it.

70
00:05:30,880 --> 00:05:35,800
This is my favorite part of this because we're really excited about getting our hands on Gemini

71
00:05:35,800 --> 00:05:44,720
1.5 Pro because of its 200k to maybe 1 million token context.

72
00:05:44,720 --> 00:05:49,200
So you can give it lots of information, but it's recall ability that excites me the most.

73
00:05:49,200 --> 00:05:54,820
Claw 2.1 was still our go-to, wasn't it Martin, for summarizing long documents, cool transcripts,

74
00:05:54,820 --> 00:05:56,160
just because of its accuracy.

75
00:05:56,160 --> 00:06:02,560
And what this would make it sound like is the new Claude Opus model is even more accurate

76
00:06:02,560 --> 00:06:03,560
than that.

77
00:06:03,560 --> 00:06:10,400
So given its context window, it's like getting access to Gemini 1.5 Pro's recall capabilities,

78
00:06:10,400 --> 00:06:12,160
but without having to wait for that to be released.

79
00:06:12,160 --> 00:06:16,240
So that's the thing I'm most excited about testing with this model is summarizing long

80
00:06:16,240 --> 00:06:18,740
documents, cool transcripts and stuff like that.

81
00:06:18,740 --> 00:06:22,280
They go on in the paper to talk about some of the other capabilities of the individual

82
00:06:22,280 --> 00:06:25,600
models themselves about Claude 3 Opus.

83
00:06:25,600 --> 00:06:30,640
They say it's the most intelligent model and they explicitly say best in market performance

84
00:06:30,640 --> 00:06:32,900
on complex tasks.

85
00:06:32,900 --> 00:06:39,400
And it says Opus shows us the outer limits of what's possible with generative AI.

86
00:06:39,400 --> 00:06:43,560
So it can navigate open-ended prompts and cite unseen scenarios.

87
00:06:43,560 --> 00:06:50,000
So things where it's never been shown this situation before, and it can rationalize and

88
00:06:50,000 --> 00:06:54,060
navigate its way through with human-like understanding.

89
00:06:54,060 --> 00:06:59,960
They talk about it being good for task automation, such as planning and executing complex actions

90
00:06:59,960 --> 00:07:07,360
across APIs and databases, for research reviews, brainstorming, hypothesis generation, drug

91
00:07:07,360 --> 00:07:10,600
discoveries mentioned in there as well.

92
00:07:10,600 --> 00:07:15,320
They talk specifically about strategy and advanced analysis of charts and graphs, financials

93
00:07:15,320 --> 00:07:17,240
and market trends.

94
00:07:17,240 --> 00:07:22,440
And the differentiator, they say, is higher intelligence than any other model available.

95
00:07:22,440 --> 00:07:25,720
They're very explicit about that.

96
00:07:25,720 --> 00:07:26,840
Yeah, I'm excited.

97
00:07:26,840 --> 00:07:31,440
I think with all these things, its performance in the benchmarks is impressive.

98
00:07:31,440 --> 00:07:36,160
It basically beats all the other models, including Gemini Ultra, but it's all about getting it

99
00:07:36,160 --> 00:07:39,560
in your hands and finding those nuances of where it's really good.

100
00:07:39,560 --> 00:07:45,280
For example, there are some use cases where I use Gemini Pro the original, which is the

101
00:07:45,280 --> 00:07:48,400
GPT 3.5 level model.

102
00:07:48,400 --> 00:07:51,000
And like its copywriting ability is pretty good.

103
00:07:51,000 --> 00:07:56,760
It's certainly better than GPT 3.5, maybe not quite as good as GPT 4, but it's a strong

104
00:07:56,760 --> 00:07:59,200
model that's really cheap to use.

105
00:07:59,200 --> 00:08:04,920
So I think an interesting thing will be how do we stretch Claude 3 Opus and see what it

106
00:08:04,920 --> 00:08:05,920
can really do?

107
00:08:05,920 --> 00:08:10,040
I want to give it a brief about a piece of content I want to create, and then I want

108
00:08:10,040 --> 00:08:14,840
it to guide itself with me nudging it through the outline process to draft without me having

109
00:08:14,840 --> 00:08:18,680
to do so much directing and just see what the quality of output that I can get, which

110
00:08:18,680 --> 00:08:23,800
is a test I run on most of the models where in general, I'm not always pleased with the

111
00:08:23,800 --> 00:08:27,040
outputs at every stage and I have to edit them quite heavily.

112
00:08:27,040 --> 00:08:29,160
Let's see what Claude 3 Opus can do.

113
00:08:29,160 --> 00:08:33,600
But yeah, that and feeding a bunch of data and seeing how good it is at data analysis,

114
00:08:33,600 --> 00:08:38,880
because having a data analyst in the team is always useful, especially in the modern

115
00:08:38,880 --> 00:08:40,840
world where we collect so much data.

116
00:08:40,840 --> 00:08:42,560
But not everybody's good at data analysis.

117
00:08:42,560 --> 00:08:47,200
So having a tool like this that can guide you through it, that sounds quite code interpreter

118
00:08:47,200 --> 00:08:48,920
like is just, is it better?

119
00:08:48,920 --> 00:08:51,000
It's going to be the question, I think.

120
00:08:51,000 --> 00:08:56,560
In the first test that I've done with it, because I've got a live project at the moment,

121
00:08:56,560 --> 00:09:01,920
which requires analysis of lots of text based inputs.

122
00:09:01,920 --> 00:09:09,320
Each record is somebody documenting organizations they've worked with, people they've worked

123
00:09:09,320 --> 00:09:15,040
with, what the outcome of an activity was, whether it instigated any change, whether

124
00:09:15,040 --> 00:09:19,160
there were any hindrances to them achieving what they were hoping to achieve.

125
00:09:19,160 --> 00:09:26,480
I've been running this analysis through Claude 2.1 and getting slightly frustrated when working

126
00:09:26,480 --> 00:09:33,200
with the client, we're kind of happy with what it says, but it lacks detail and specificity.

127
00:09:33,200 --> 00:09:38,680
Well yesterday when they announced Claude 3, I did a raw dump of the data.

128
00:09:38,680 --> 00:09:40,120
There was no real finessing of it.

129
00:09:40,120 --> 00:09:48,960
I took 75 records in total, put them in and asked for a summary of the output.

130
00:09:48,960 --> 00:09:53,620
And the detail was significantly higher.

131
00:09:53,620 --> 00:09:57,280
And without doubt, the analysis was better.

132
00:09:57,280 --> 00:10:03,040
The ability to cross reference between projects and find commonalities and also that kind

133
00:10:03,040 --> 00:10:10,000
of compare and contrast element was just so much stronger than what I'd seen with GPT-4

134
00:10:10,000 --> 00:10:12,800
or with Claude 2.1.

135
00:10:12,800 --> 00:10:15,400
So I'm excited for this.

136
00:10:15,400 --> 00:10:20,480
We talked on the WhatsApps didn't we about market research data, customer interviews

137
00:10:20,480 --> 00:10:26,020
and being able to feed it a load of that data and asking it to pull out the trends and hopefully

138
00:10:26,020 --> 00:10:31,200
being able to trust in its ability to recall everything it's been given, which the data

139
00:10:31,200 --> 00:10:37,280
suggests it's much better at, but also find the nuances and the trends in a way that,

140
00:10:37,280 --> 00:10:40,960
I mean ideally in a way that a human might not even see.

141
00:10:40,960 --> 00:10:43,760
But certainly at that level, because I think you're right in some of the tests that we've

142
00:10:43,760 --> 00:10:55,720
run, its ability to just ignore fairly sizable bits of data or particular comments from customers

143
00:10:55,720 --> 00:10:56,720
makes you not trust it.

144
00:10:56,720 --> 00:11:00,960
Because you know if a human read that transcript they'd go, did you see that statement?

145
00:11:00,960 --> 00:11:04,520
That's massive and the AI has just completely ignored it.

146
00:11:04,520 --> 00:11:09,480
And again, until we have that level of power, I think you might use the AI to speed up your

147
00:11:09,480 --> 00:11:13,440
analysis or to double check your human analysis, but you wouldn't trust it instead of a human

148
00:11:13,440 --> 00:11:14,440
at this point.

149
00:11:14,440 --> 00:11:18,200
And that's something else I'm quite excited to test Opus on.

150
00:11:18,200 --> 00:11:22,280
But it's interesting because even Sonnet is pretty handy, right?

151
00:11:22,280 --> 00:11:25,560
You said it's probably Claude 2.1 level.

152
00:11:25,560 --> 00:11:31,840
You got to pay to access Opus, I think it's 18 GBP British pounds a month to get a pro

153
00:11:31,840 --> 00:11:38,000
account with Anthropic, but you can still access Claude 3 Sonnet, which is 2.1 level

154
00:11:38,000 --> 00:11:39,800
for free.

155
00:11:39,800 --> 00:11:42,040
And to all intents and purposes, it's still pretty good.

156
00:11:42,040 --> 00:11:46,440
And there's a number of things they recommend that you can trust Claude 3 Sonnet with, isn't

157
00:11:46,440 --> 00:11:47,440
there, Martin?

158
00:11:47,440 --> 00:11:50,680
Yeah, so it's actually Claude 2.1 beating.

159
00:11:50,680 --> 00:11:53,440
It is a higher level model.

160
00:11:53,440 --> 00:11:59,820
They give potential use cases as being data processing with rag capabilities or search

161
00:11:59,820 --> 00:12:05,320
and retrieval over vast amounts of knowledge, product recommendations, forecasting, targeted

162
00:12:05,320 --> 00:12:11,440
marketing, time-saving tasks such as code generation, quality control, parsing text

163
00:12:11,440 --> 00:12:17,400
from images, those kinds of easy, well, I say easy tasks, those kind of basic tasks.

164
00:12:17,400 --> 00:12:22,400
And then they say it's more affordable than the other models with similar intelligence,

165
00:12:22,400 --> 00:12:24,560
much better for scale.

166
00:12:24,560 --> 00:12:28,160
And when you start playing with Sonnet, you do get a feel for, oh, this is actually a

167
00:12:28,160 --> 00:12:29,520
very capable model.

168
00:12:29,520 --> 00:12:33,280
This doesn't feel like an also ran in the lineup.

169
00:12:33,280 --> 00:12:40,360
And actually just to take that even further, I was doing a comparison of Haiku, the smallest

170
00:12:40,360 --> 00:12:49,680
model versus the, well, comparable models, which would be GPT 3.5 Turbo.

171
00:12:49,680 --> 00:12:54,120
And it beats 3.5 Turbo on all the benchmarks.

172
00:12:54,120 --> 00:13:00,280
It beats Gemini Pro on all of the benchmarks.

173
00:13:00,280 --> 00:13:06,960
And its cost is cheaper than all of them as well.

174
00:13:06,960 --> 00:13:07,960
That's impressive.

175
00:13:07,960 --> 00:13:11,800
I mean, there's a couple of trends that are coming out of this.

176
00:13:11,800 --> 00:13:17,520
Firstly, we're now seeing the launch of multiple models at different levels in one go from

177
00:13:17,520 --> 00:13:18,960
those providers, right?

178
00:13:18,960 --> 00:13:23,000
Gemini Nano, which I've never heard of again since the launch.

179
00:13:23,000 --> 00:13:24,060
Maybe they've forgotten about it.

180
00:13:24,060 --> 00:13:25,920
Maybe they're going to be baking it into mobile devices.

181
00:13:25,920 --> 00:13:27,000
Who knows?

182
00:13:27,000 --> 00:13:28,920
But Nano Pro and Ultra.

183
00:13:28,920 --> 00:13:32,880
Now we've got Haiku, Sonnet and Opus.

184
00:13:32,880 --> 00:13:34,880
There's Mistral Medium, Mistral Large.

185
00:13:34,880 --> 00:13:37,760
We'll talk a little bit about that in a moment.

186
00:13:37,760 --> 00:13:43,680
So there is a tendency now towards different size models for different things that have

187
00:13:43,680 --> 00:13:48,020
pros and cons mostly related to cost and speed, as far as I can tell.

188
00:13:48,020 --> 00:13:53,320
We spoke previously on the podcast about why would I ever use GPT 3.5 if I've got access

189
00:13:53,320 --> 00:13:55,040
to GPT 4.

190
00:13:55,040 --> 00:14:00,180
And I think that for a chat interface where you're asking questions of it, like most of

191
00:14:00,180 --> 00:14:04,240
us have got used to doing with chat GPT, I still think that's a fair criticism, to be

192
00:14:04,240 --> 00:14:07,000
honest, until you max out your usage, let's say.

193
00:14:07,000 --> 00:14:12,040
But where I really see these multiple models having use is when AI is almost running in

194
00:14:12,040 --> 00:14:14,080
the background, right?

195
00:14:14,080 --> 00:14:17,600
You've got a lead, it comes through your website.

196
00:14:17,600 --> 00:14:21,920
A Haiku level model goes, scrapes a load of information off the internet, tries to understand

197
00:14:21,920 --> 00:14:25,600
what it means, adds it to the relevant places in the contact record.

198
00:14:25,600 --> 00:14:30,180
Because do you need the absolute super duper compute power of an Opus for that?

199
00:14:30,180 --> 00:14:31,180
Probably not.

200
00:14:31,180 --> 00:14:36,720
So I've changed my mind a little bit in terms of, yes, when I'm manually interacting with

201
00:14:36,720 --> 00:14:39,600
a model, depending on what I'm trying to do, I'm probably going to try and use the best

202
00:14:39,600 --> 00:14:42,960
model I can, as long as it's not prohibitively expensive for me.

203
00:14:42,960 --> 00:14:47,040
But I do think when AI gets baked into everything, it's where these other models will start to

204
00:14:47,040 --> 00:14:48,240
shine.

205
00:14:48,240 --> 00:14:53,040
We've also talked on the podcast about the importance of speed for real-time feedback,

206
00:14:53,040 --> 00:14:57,200
like some of these emerging customer service bots that we'll talk about later that are

207
00:14:57,200 --> 00:15:01,240
able to hold an audio conversation with a person.

208
00:15:01,240 --> 00:15:04,840
Well, they need fast latency because if you have to wait for the other person to respond

209
00:15:04,840 --> 00:15:10,520
too long, it's a jarring experience, but it also just screams bot, right?

210
00:15:10,520 --> 00:15:12,720
So the speeds of the models will be important.

211
00:15:12,720 --> 00:15:15,600
Then finally, we've talked about onboard models.

212
00:15:15,600 --> 00:15:19,320
So you're on your mobile phone, you want to interact with the chat bot, you haven't got

213
00:15:19,320 --> 00:15:22,880
access to the internet, or you want to keep the data on your phone private, you need to

214
00:15:22,880 --> 00:15:24,160
be able to run a model locally.

215
00:15:24,160 --> 00:15:26,440
And this is again where those small models will come in.

216
00:15:26,440 --> 00:15:33,580
So Anthropic clearly subscribing to this multiple tier moddable point of view, and then us as

217
00:15:33,580 --> 00:15:39,520
users now have to try and get to grips with which model is going to be right for certain

218
00:15:39,520 --> 00:15:40,520
tasks.

219
00:15:40,520 --> 00:15:43,840
And I actually really appreciate the blog post that Anthropic wrote about this, where

220
00:15:43,840 --> 00:15:49,240
they gave us those use cases, because that's kind of been missing for me a bit in terms

221
00:15:49,240 --> 00:15:53,520
of when should you use each model other than just testing it yourself.

222
00:15:53,520 --> 00:15:55,120
It's nice to get some guidance.

223
00:15:55,120 --> 00:15:57,080
Yeah, very much so.

224
00:15:57,080 --> 00:16:01,360
There's some additional detail that's come out around Claude 3 that I think is worth

225
00:16:01,360 --> 00:16:02,680
touching on.

226
00:16:02,680 --> 00:16:09,560
And it's high scores in what is called the GPQA benchmark.

227
00:16:09,560 --> 00:16:14,400
So the GPQA benchmark is the Google-truth question and answer.

228
00:16:14,400 --> 00:16:21,040
And the idea is that there's a challenging data set of 448 multiple choice questions

229
00:16:21,040 --> 00:16:28,880
written by domain level experts, PhD level experts across biology, physics, and chemistry.

230
00:16:28,880 --> 00:16:31,680
And it tests factual knowledge and reasoning abilities.

231
00:16:31,680 --> 00:16:36,120
And the idea is that these questions are very hard for people to answer.

232
00:16:36,120 --> 00:16:42,120
And if you're not a domain level expert, they're hard for you to answer even if you have access

233
00:16:42,120 --> 00:16:45,640
to Google to try and answer them.

234
00:16:45,640 --> 00:16:54,320
So this benchmark, when human city, PhDs with access to the internet, they score 34% on

235
00:16:54,320 --> 00:16:56,720
questions outside of their domain.

236
00:16:56,720 --> 00:16:58,480
That goes to show how difficult it is.

237
00:16:58,480 --> 00:16:59,600
They get one in three.

238
00:16:59,600 --> 00:17:02,840
These are multiple choice questions as well, right?

239
00:17:02,840 --> 00:17:09,640
And then inside their domain, they score between 65 and 75%.

240
00:17:09,640 --> 00:17:18,360
Claude 3 scores 60% overall, which goes to show the...

241
00:17:18,360 --> 00:17:22,680
In fact, it reinforces a point that Sam Altman made in conversation with The Economist a

242
00:17:22,680 --> 00:17:29,520
few weeks ago, which is when these models improve, they improve in a very general way.

243
00:17:29,520 --> 00:17:32,000
It's not in one specific field.

244
00:17:32,000 --> 00:17:35,240
And that's the thing that's kind of mind blowing about it.

245
00:17:35,240 --> 00:17:41,080
This test, the GPQA, was only designed a few months ago.

246
00:17:41,080 --> 00:17:44,280
It was the back end of 2023 when this was proposed.

247
00:17:44,280 --> 00:17:49,680
The idea was that this was the kind of benchmark that was going to be really difficult for

248
00:17:49,680 --> 00:17:53,760
AIs to pass for a while yet.

249
00:17:53,760 --> 00:17:54,760
And here we are.

250
00:17:54,760 --> 00:17:56,120
Yeah, I love this.

251
00:17:56,120 --> 00:17:58,120
I love the benchmark.

252
00:17:58,120 --> 00:18:00,400
I love the result that Claude got.

253
00:18:00,400 --> 00:18:06,880
I love the fact that people are thinking about more serious ways to test the tools as they

254
00:18:06,880 --> 00:18:07,880
evolve.

255
00:18:07,880 --> 00:18:14,000
I mean, we've talked previously, I remember when Gemini was released in December and we

256
00:18:14,000 --> 00:18:17,040
looked at its performance on the benchmarks against GPT-4.

257
00:18:17,040 --> 00:18:21,800
And in a lot of areas, the scores are like 85%, 90%.

258
00:18:21,800 --> 00:18:25,520
And you start to have a series of tests that are no longer informative, right?

259
00:18:25,520 --> 00:18:28,880
What's the difference between going from 90 to 92%?

260
00:18:28,880 --> 00:18:30,400
Does that really mean that much?

261
00:18:30,400 --> 00:18:32,800
And it's going to depend on the test, right?

262
00:18:32,800 --> 00:18:38,080
So I love the fact that people created harder tests, but it's just how quickly AI is now

263
00:18:38,080 --> 00:18:40,120
able to do really well on those tests.

264
00:18:40,120 --> 00:18:45,240
And 60% gives us plenty of stretch room, right, to try and see if we can get it better.

265
00:18:45,240 --> 00:18:51,480
But the fact that it can be a domain, a near domain expert, I would say, in all domains

266
00:18:51,480 --> 00:18:53,280
is pretty impressive.

267
00:18:53,280 --> 00:18:57,680
The other thing is whenever I see PhDs, these are domain experts.

268
00:18:57,680 --> 00:19:01,000
So these are probably not PhD students.

269
00:19:01,000 --> 00:19:07,040
These are 20-year veterans in their fields that happen to have a PhD that they got at

270
00:19:07,040 --> 00:19:08,480
the beginning of their studies.

271
00:19:08,480 --> 00:19:12,600
So whilst PhD is really helpful, Monica here, I think it's more important to think about

272
00:19:12,600 --> 00:19:14,640
their domain expertise.

273
00:19:14,640 --> 00:19:21,600
And when I look at CLAWD3 gets 60% and those CSN experts get 65 to 75, I think CLAWD3 is

274
00:19:21,600 --> 00:19:26,400
probably beating most PhD students coming towards the end of their studies, right?

275
00:19:26,400 --> 00:19:30,640
But you meet someone who's just finished a PhD, they are very knowledgeable about the

276
00:19:30,640 --> 00:19:32,760
topic that they're in.

277
00:19:32,760 --> 00:19:35,840
So those conversations with CLAWD3 would also be fascinating.

278
00:19:35,840 --> 00:19:43,760
And one has to expect a response from our good friends over at OpenAI because they love

279
00:19:43,760 --> 00:19:48,800
to troll another company's important and exciting releases.

280
00:19:48,800 --> 00:19:55,200
So GPT 4.5 or GPT5's release might have just been pushed forward a little bit, I would

281
00:19:55,200 --> 00:20:01,440
suspect if and when it's ready to go because it now looks like we've all got to get to

282
00:20:01,440 --> 00:20:05,440
grips of playing with CLAWD3 because we might get better results than GPT4.

283
00:20:05,440 --> 00:20:07,400
Yeah, watch this space.

284
00:20:07,400 --> 00:20:13,640
I will be monitoring that OpenAI blog with keen interest.

285
00:20:13,640 --> 00:20:21,160
So if you're a marketer, if you're a sales professional, get a hold of a CLAWD Opus Pro

286
00:20:21,160 --> 00:20:26,560
account £18 a month just to test it and see how well it compares to the outputs that you

287
00:20:26,560 --> 00:20:27,840
get from GPT4.

288
00:20:27,840 --> 00:20:35,760
Personally, as a user of Magi, which I plug frequently on this podcast, the lead developer

289
00:20:35,760 --> 00:20:38,720
there, Dustin, is awesome for making rapid updates to the tool.

290
00:20:38,720 --> 00:20:42,320
So I've got my fingers crossed that I'm going to be able to access Opus in the back

291
00:20:42,320 --> 00:20:44,440
end of Magi by the end of the week.

292
00:20:44,440 --> 00:20:47,640
I might give him a little bit of a nudge in the Facebook group, see if he can make that

293
00:20:47,640 --> 00:20:49,800
happen for us.

294
00:20:49,800 --> 00:20:55,000
Is it worth actually just briefly talking about cost because Opus is quite expensive,

295
00:20:55,000 --> 00:20:56,000
isn't it?

296
00:20:56,000 --> 00:20:57,000
Yeah, it is.

297
00:20:57,000 --> 00:21:04,080
And for developers such as the team behind Magi, this is going to be a big consideration.

298
00:21:04,080 --> 00:21:09,440
So they price it based on input tokens as well as output tokens.

299
00:21:09,440 --> 00:21:21,920
Opus, the biggest model, is $15 per million tokens input and $75 per million tokens output.

300
00:21:21,920 --> 00:21:31,120
By way of comparison to GPT4 Turbo, that's $10 in, $30 out.

301
00:21:31,120 --> 00:21:36,540
So it's two and a half times as expensive in terms of the outputs.

302
00:21:36,540 --> 00:21:44,200
When you compare it to GPT4, not GPT4 Turbo, it is more comparable because GPT4 is $120

303
00:21:44,200 --> 00:21:45,200
per million out.

304
00:21:45,200 --> 00:21:47,800
So it's more expensive there.

305
00:21:47,800 --> 00:22:00,320
Haiku at the other end of the scale is 25 cents per million in and $1.25 out per million.

306
00:22:00,320 --> 00:22:09,240
And that compares to GPT3.5 Turbo, which is $1.50 out and 50 cents in.

307
00:22:09,240 --> 00:22:18,100
So it's twice the cost of the input tokens, GPT3.5 Turbo and comparable, slightly more

308
00:22:18,100 --> 00:22:19,720
expensive on the outputs.

309
00:22:19,720 --> 00:22:21,180
Yeah, that's interesting.

310
00:22:21,180 --> 00:22:23,380
We debate off air, don't we?

311
00:22:23,380 --> 00:22:26,820
How much detail should we go into when it comes to API stuff?

312
00:22:26,820 --> 00:22:28,120
How nerdy should we get?

313
00:22:28,120 --> 00:22:33,480
I was listening to a podcast last weekend.

314
00:22:33,480 --> 00:22:34,480
Was it the All In podcast?

315
00:22:34,480 --> 00:22:40,160
I think it was, where they were talking about midsize companies and large enterprises having

316
00:22:40,160 --> 00:22:45,560
to really think long and hard about their enterprise level software subscriptions because

317
00:22:45,560 --> 00:22:52,520
of the ability to build low code and no code products using tools like Bubble and using

318
00:22:52,520 --> 00:23:00,560
tools like GPT4 and now Claw3 to help you write the code, that your ability as a marketer,

319
00:23:00,560 --> 00:23:05,600
if you're willing to have a play or to even have a outsourced development pipeline, software

320
00:23:05,600 --> 00:23:10,640
development pipeline, or build a team internally to customize and build tools that work just

321
00:23:10,640 --> 00:23:13,680
for your workflow.

322
00:23:13,680 --> 00:23:19,840
There's lots of discussions about the future of developers, which I don't feel informed

323
00:23:19,840 --> 00:23:21,620
enough to comment on, to be honest.

324
00:23:21,620 --> 00:23:26,480
The one thing that has definitely been emerging and will continue to evolve is the ability

325
00:23:26,480 --> 00:23:31,560
of non-developers to build software tools pretty quickly, pretty cheaply that work really

326
00:23:31,560 --> 00:23:33,060
well.

327
00:23:33,060 --> 00:23:35,320
A lot of that is driven by API usage.

328
00:23:35,320 --> 00:23:42,640
If you're using natural language in or out, you're going to be using the APIs for Claude,

329
00:23:42,640 --> 00:23:46,620
for ChatGPT, and therefore this is useful stuff to know.

330
00:23:46,620 --> 00:23:52,160
If you are a marketer, I think it's definitely worth thinking, what repetitive tasks do I

331
00:23:52,160 --> 00:23:56,200
do that there isn't a tool out there for that I could particularly easily create one in

332
00:23:56,200 --> 00:23:58,920
Bubble or some other tool?

333
00:23:58,920 --> 00:24:05,520
We spoke about Claude for Sheets, the integration with Google Sheets with Anthropics Claude,

334
00:24:05,520 --> 00:24:07,600
and that works with these new models.

335
00:24:07,600 --> 00:24:13,520
You can just input the model name into the function that you're calling and it will then

336
00:24:13,520 --> 00:24:16,220
integrate neatly with it.

337
00:24:16,220 --> 00:24:22,880
You can now build a spreadsheet that can automate the content creation for your social media

338
00:24:22,880 --> 00:24:25,840
using these high-end models.

339
00:24:25,840 --> 00:24:33,080
You can do data cleansing of your database so that you've got consistent capitalization

340
00:24:33,080 --> 00:24:38,540
and formatting of names, email addresses, telephone numbers, and you can use the Haiku

341
00:24:38,540 --> 00:24:43,340
model for very low cost to do that for you.

342
00:24:43,340 --> 00:24:46,160
These are the things that we should be thinking about because you don't need to be a developer

343
00:24:46,160 --> 00:24:47,360
to use those.

344
00:24:47,360 --> 00:24:55,920
You just need to know how to write a function in a Google Sheet or an Excel doc equivalent

345
00:24:55,920 --> 00:24:59,160
and then off you go.

346
00:24:59,160 --> 00:25:00,160
Yeah.

347
00:25:00,160 --> 00:25:05,960
Then it comes down to imagination and knowing what outputs and outcomes you want and figuring

348
00:25:05,960 --> 00:25:08,080
how to structure the information.

349
00:25:08,080 --> 00:25:12,920
One of the interesting things about Claude 3 is now the introduction of multimodality

350
00:25:12,920 --> 00:25:17,960
in terms of being able to interpret images and obviously handle data a bit like code

351
00:25:17,960 --> 00:25:19,400
interpreters.

352
00:25:19,400 --> 00:25:24,880
The breadth and depth of things that you can push into these models to get them to do stuff

353
00:25:24,880 --> 00:25:26,620
is ever more expanding.

354
00:25:26,620 --> 00:25:34,200
I was showing the team a use case for... There was a table on the internet that had 80% of

355
00:25:34,200 --> 00:25:37,480
the information that I wanted for a slide deck that I was putting together, but it didn't

356
00:25:37,480 --> 00:25:38,480
have everything.

357
00:25:38,480 --> 00:25:45,100
I just took a screen grab with the snipping tool, fed it into ChatGPT and it gave me tab

358
00:25:45,100 --> 00:25:49,160
delimited text that I could copy paste into Excel to get that whole table.

359
00:25:49,160 --> 00:25:54,760
That whole process took me about 20 seconds versus 15 minutes to manually retype that

360
00:25:54,760 --> 00:25:56,600
table out again.

361
00:25:56,600 --> 00:26:00,680
That's just scratching the surface of what you could do with text analysis of images

362
00:26:00,680 --> 00:26:03,160
and other types of analysis of images.

363
00:26:03,160 --> 00:26:06,160
I think there's so much cool stuff to do.

364
00:26:06,160 --> 00:26:09,080
I think we've given Claw3 some love.

365
00:26:09,080 --> 00:26:14,160
They are not sponsors of the podcast, although Anthropic, please do get in touch if you want

366
00:26:14,160 --> 00:26:15,920
to chuck some dollars our way.

367
00:26:15,920 --> 00:26:16,920
We will happily take those.

368
00:26:16,920 --> 00:26:19,320
No, I'm only joking.

369
00:26:19,320 --> 00:26:25,040
The other new model this week was Mistral Large, Martin, which is quite interesting.

370
00:26:25,040 --> 00:26:29,560
I don't think we'll spend too much time on this, but for those of you that are not aware,

371
00:26:29,560 --> 00:26:34,360
Mistral is a very well-funded startup and where it's different from the open AIs of

372
00:26:34,360 --> 00:26:36,600
the world is it's completely open source.

373
00:26:36,600 --> 00:26:43,520
They're developing super duper language models like GPT-4, but they're open sourcing them

374
00:26:43,520 --> 00:26:46,640
for people to use as they like.

375
00:26:46,640 --> 00:26:50,960
This week they released Mistral Large, which is their latest and most advanced language

376
00:26:50,960 --> 00:26:51,960
model.

377
00:26:51,960 --> 00:26:57,880
They've also released Le Chat, which is their Chat GPT equivalent.

378
00:26:57,880 --> 00:26:58,900
Is it the chat?

379
00:26:58,900 --> 00:26:59,900
Is it the cat?

380
00:26:59,900 --> 00:27:01,560
Who knows?

381
00:27:01,560 --> 00:27:03,040
But that's now also available.

382
00:27:03,040 --> 00:27:08,040
I think one of the interesting things about Mistral Large is just the fact that it's an

383
00:27:08,040 --> 00:27:14,080
open source model, which is near GPT-4 level, Martin.

384
00:27:14,080 --> 00:27:17,280
It's an interesting thing for now people to be able to play with.

385
00:27:17,280 --> 00:27:18,280
Yeah.

386
00:27:18,280 --> 00:27:20,560
32k context window.

387
00:27:20,560 --> 00:27:27,720
It's got native fluency in English, French, Spanish, German and Italian with one of the

388
00:27:27,720 --> 00:27:31,760
best understandings of grammar and cultural context.

389
00:27:31,760 --> 00:27:33,160
So it's very good like that.

390
00:27:33,160 --> 00:27:37,000
And it's available on the Microsoft Azure cloud.

391
00:27:37,000 --> 00:27:42,240
And that's no surprise because they just had a chunk of funding from Microsoft.

392
00:27:42,240 --> 00:27:52,080
So Microsoft are not just all in on open AI, they are sharing the love with other LLM foundation

393
00:27:52,080 --> 00:27:53,080
model builders.

394
00:27:53,080 --> 00:27:55,500
So you can access it via API.

395
00:27:55,500 --> 00:27:58,480
You can access it through their infrastructure hosted in Europe.

396
00:27:58,480 --> 00:28:01,200
You can access it via Azure.

397
00:28:01,200 --> 00:28:05,080
And of course, because it's open source, you can self deploy.

398
00:28:05,080 --> 00:28:10,360
So if you want to get the model and deploy yourself into your own infrastructure, you

399
00:28:10,360 --> 00:28:11,360
can do that.

400
00:28:11,360 --> 00:28:12,360
Yeah.

401
00:28:12,360 --> 00:28:13,680
And you can play with the model weights.

402
00:28:13,680 --> 00:28:16,280
And this is all the benefits of open source, right?

403
00:28:16,280 --> 00:28:20,480
Like they create it, they give it to you and they say, here's everything you need to know.

404
00:28:20,480 --> 00:28:22,360
Here's full control.

405
00:28:22,360 --> 00:28:23,800
Deploy it as you see fit.

406
00:28:23,800 --> 00:28:29,720
And I think there is some belief that in the end, all of this racing between models is

407
00:28:29,720 --> 00:28:33,040
going to become irrelevant because they're basically going to be commoditized down to

408
00:28:33,040 --> 00:28:36,180
the fact that there's an open source high power model.

409
00:28:36,180 --> 00:28:41,200
There's paid for models that are just easy to deploy, whether that's cloud based like

410
00:28:41,200 --> 00:28:45,760
in Azure or you're just using a tool like chat GPT.

411
00:28:45,760 --> 00:28:50,240
But while we're part of that race, it's hard not to get caught up in everybody like nudging

412
00:28:50,240 --> 00:28:52,040
each other forward.

413
00:28:52,040 --> 00:28:53,920
Open source is now at the level of GPT-4.

414
00:28:53,920 --> 00:28:56,720
GPT-4 is the standard, but now Claw 3 has come out.

415
00:28:56,720 --> 00:28:58,260
So now is Claw 3 the standard?

416
00:28:58,260 --> 00:28:59,780
When does GPT-5 come out?

417
00:28:59,780 --> 00:29:02,760
When can I play with Gemini 1.5 Pro?

418
00:29:02,760 --> 00:29:06,720
And I think most of it just comes down to at this point, use cases and figuring out

419
00:29:06,720 --> 00:29:08,320
which of those models is going to be best.

420
00:29:08,320 --> 00:29:14,040
But if you are looking at building your own tools and wanting real control over the large

421
00:29:14,040 --> 00:29:18,520
language model, then obviously open source models like Mistral Large are a really good

422
00:29:18,520 --> 00:29:19,520
option.

423
00:29:19,520 --> 00:29:24,520
If you're a consumer user, using it personally or professionally as an assistant and you're

424
00:29:24,520 --> 00:29:31,520
unsure whether to use Claude AI or chat GPT or copilot or any of these kind of chat based

425
00:29:31,520 --> 00:29:35,720
interfaces, then the Mistral one is an interesting one to have a look at because they seem to

426
00:29:35,720 --> 00:29:38,960
have slightly fewer guardrails on it.

427
00:29:38,960 --> 00:29:43,960
So if you get frustrated with the model saying, oh, sorry, I can't do that because of copyright

428
00:29:43,960 --> 00:29:47,380
or because of ethics or because of this, that and the other.

429
00:29:47,380 --> 00:29:53,280
My experience with Mistral so far has been that it's a bit more, you know, willing.

430
00:29:53,280 --> 00:29:55,000
There you go.

431
00:29:55,000 --> 00:29:59,400
With great power comes great responsibility people use with care.

432
00:29:59,400 --> 00:30:03,400
Let's switch briefly to image generation, Mian.

433
00:30:03,400 --> 00:30:08,200
Tell us about Jasper's acquisition of ClipDrop this week.

434
00:30:08,200 --> 00:30:10,520
ClipDrop was once our tool of the week.

435
00:30:10,520 --> 00:30:17,420
It was built by Stability AI and it was the front end really to a bunch of tools for image

436
00:30:17,420 --> 00:30:18,420
manipulation.

437
00:30:18,420 --> 00:30:20,480
So you could do background removal.

438
00:30:20,480 --> 00:30:22,460
You could do object replacement.

439
00:30:22,460 --> 00:30:26,840
You could do image generation and well, you were a big fan of it, weren't you?

440
00:30:26,840 --> 00:30:28,920
There was a bunch of tools that you liked for it.

441
00:30:28,920 --> 00:30:32,760
I still think its background removal tool is the best on the market.

442
00:30:32,760 --> 00:30:37,000
We've got Canva as well and I think it's better than Canva's background removal, even though

443
00:30:37,000 --> 00:30:43,000
to my understanding they're built on the same Stability AI stable diffusion architecture.

444
00:30:43,000 --> 00:30:48,520
But the ClipDrop background remover tool, its ability to strip backgrounds out of incredibly

445
00:30:48,520 --> 00:30:50,760
complex images is impressive.

446
00:30:50,760 --> 00:30:54,160
I routinely use it for that and image upscaling.

447
00:30:54,160 --> 00:30:58,920
It has a whole host of other image generation tools, but for me it's honestly not better

448
00:30:58,920 --> 00:31:05,280
than ChatGPT, which is now Dually 3 built in obviously and Mid Journey is still the

449
00:31:05,280 --> 00:31:07,640
absolute go-to.

450
00:31:07,640 --> 00:31:14,200
Although Stability AI have released Stable Diffusion 3 this week as well.

451
00:31:14,200 --> 00:31:17,560
We won't go into too much detail about it on today's podcast, because we haven't had

452
00:31:17,560 --> 00:31:22,520
a chance to really play with it, but it promises to offer higher quality images.

453
00:31:22,520 --> 00:31:24,140
It's much better at text.

454
00:31:24,140 --> 00:31:27,520
So again, it's one of those things where we can give advice to you on what we think is

455
00:31:27,520 --> 00:31:30,520
the best image generation tool today, but you have to keep playing with these things

456
00:31:30,520 --> 00:31:35,680
because everyone keeps releasing new models that might be improvements.

457
00:31:35,680 --> 00:31:39,640
ClipDrop will have Stable Diffusion 3 built in at some point soon, one would have thought.

458
00:31:39,640 --> 00:31:43,000
So maybe its image generation capabilities will improve.

459
00:31:43,000 --> 00:31:47,720
But yeah, I'm mostly using it for some of that image tweaking rather than image generation.

460
00:31:47,720 --> 00:31:52,120
Well your use of it will soon come under the banner of Jasper.

461
00:31:52,120 --> 00:31:57,800
Jasper, which has long been one of the market leaders in text generation.

462
00:31:57,800 --> 00:32:02,920
They were one of the early movers to create if you wanted marketing copy for your website,

463
00:32:02,920 --> 00:32:06,640
product descriptions, meta titles, marketing emails, whatever.

464
00:32:06,640 --> 00:32:08,360
They had all of the templates for that.

465
00:32:08,360 --> 00:32:14,920
And over the years they've become more brand voice centric and you can add in your own

466
00:32:14,920 --> 00:32:19,920
knowledge base so that it understands your products and your services with more detail

467
00:32:19,920 --> 00:32:20,920
and more accuracy.

468
00:32:20,920 --> 00:32:27,800
Well they've acquired ClipDrop from Stability and they're now integrating it into the whole

469
00:32:27,800 --> 00:32:29,620
platform.

470
00:32:29,620 --> 00:32:37,460
This gives Jasper a European presence because they have Office in Paris, which is now one

471
00:32:37,460 --> 00:32:40,320
of the main European hubs for AI development.

472
00:32:40,320 --> 00:32:48,960
In terms of the product itself, it's understood that ClipDrop will stay as a standalone subscription

473
00:32:48,960 --> 00:32:54,620
available to customers, but it will become available as well through the Jasper API.

474
00:32:54,620 --> 00:32:58,120
So all of those tools you can now build into your workflows.

475
00:32:58,120 --> 00:33:04,320
And this is particularly interesting for the enterprise customers that are using Jasper

476
00:33:04,320 --> 00:33:10,280
because they're basically saying, look, we are going to become your go-to place for content

477
00:33:10,280 --> 00:33:16,280
creation and all things AI assisted content development.

478
00:33:16,280 --> 00:33:23,360
Yeah, I really wanted to get on with Jasper and Ryta and some of these other tools, but

479
00:33:23,360 --> 00:33:27,120
I just didn't find them any more useful than ChatGPT.

480
00:33:27,120 --> 00:33:29,960
And obviously they come with additional expenses.

481
00:33:29,960 --> 00:33:32,520
You scale them up across your users.

482
00:33:32,520 --> 00:33:37,200
I do think it's an interesting move by Jasper to try and make it possible and easy for you

483
00:33:37,200 --> 00:33:40,280
to create images alongside your content.

484
00:33:40,280 --> 00:33:47,020
And one assumes they'll roll out their, we understand your brand play to image generation

485
00:33:47,020 --> 00:33:49,080
alongside copy generation.

486
00:33:49,080 --> 00:33:52,820
So if you're writing a blog post or a social media post, one assumes they'll be looking

487
00:33:52,820 --> 00:33:57,360
to automatically create images to go alongside those posts that are on brand and all this

488
00:33:57,360 --> 00:33:58,360
other stuff.

489
00:33:58,360 --> 00:34:02,240
So I think that'll be interesting to see how that emerges.

490
00:34:02,240 --> 00:34:09,620
It was interesting to see a writing company in essence by an image generation company.

491
00:34:09,620 --> 00:34:12,360
One assumes that's a done for you, right?

492
00:34:12,360 --> 00:34:14,680
Like they don't have to worry about how they're going to integrate it.

493
00:34:14,680 --> 00:34:19,080
They don't have to worry about how to access and leverage the models and et cetera.

494
00:34:19,080 --> 00:34:25,160
But yeah, my biggest fear was, do I have to be a Jasper customer to access ClipDrop?

495
00:34:25,160 --> 00:34:27,280
And the answer at least for now is no.

496
00:34:27,280 --> 00:34:31,720
So I can still do my little image tweaking that I can do today in ClipDrop.

497
00:34:31,720 --> 00:34:35,880
Although dear listeners, Magi also has some image editing capabilities built into the

498
00:34:35,880 --> 00:34:37,920
back of that because it is one of the coolest tools.

499
00:34:37,920 --> 00:34:40,720
So maybe I'll just have to lean even further into Magi.

500
00:34:40,720 --> 00:34:41,720
But yes.

501
00:34:41,720 --> 00:34:50,160
Right, next story is about a really interesting new study that came out this week entitled

502
00:34:50,160 --> 00:34:54,900
Web Arena, a realistic web environment for building autonomous agents.

503
00:34:54,900 --> 00:34:59,880
And the reason this is interesting is as Martin alluded to earlier in the podcast, agents

504
00:34:59,880 --> 00:35:04,200
are the next frontier of how we're going to interact with these models because they're

505
00:35:04,200 --> 00:35:09,080
going to be able to carry out more complex tasks and get jobs done for us.

506
00:35:09,080 --> 00:35:12,660
So at the moment, most of you who listen to this podcast, I would have thought of played

507
00:35:12,660 --> 00:35:16,720
with large language models and played with chat GPT.

508
00:35:16,720 --> 00:35:18,560
And it's very iterative, right?

509
00:35:18,560 --> 00:35:22,520
You have to, you ask a question, you're very much involved in the conversation, shaping

510
00:35:22,520 --> 00:35:24,460
the output that you get.

511
00:35:24,460 --> 00:35:30,560
And you're as much as you can access external databases through things like custom GPTs,

512
00:35:30,560 --> 00:35:35,960
at least in my experience, they're a bit ropey, they don't work that great.

513
00:35:35,960 --> 00:35:41,680
So what we're still waiting for is our ability to give a large language model a task that

514
00:35:41,680 --> 00:35:45,440
has multiple steps in and then it comes back and it delivers an output.

515
00:35:45,440 --> 00:35:52,500
So what the researchers did here was they created an environment for modeling the complexities

516
00:35:52,500 --> 00:35:56,520
of real world scenario so you can test AI agents on them.

517
00:35:56,520 --> 00:36:05,640
And in essence, it allows the agents to do things like clicking on web pages, browsing,

518
00:36:05,640 --> 00:36:10,560
accessing information, interacting with e-commerce sites, social forum discussions, just to try

519
00:36:10,560 --> 00:36:17,960
and create a rigorous environment for testing these agents so you can compare their capabilities.

520
00:36:17,960 --> 00:36:22,880
They've also built in the ability to go and access additional information like the equivalent

521
00:36:22,880 --> 00:36:27,400
of checking a manual, like a human would do if it was trying to solve a task.

522
00:36:27,400 --> 00:36:32,880
So A, I think that's interesting because I think it's an important step towards understanding

523
00:36:32,880 --> 00:36:36,440
how useful agents are and then tracking their performance as they improve.

524
00:36:36,440 --> 00:36:41,160
We talked about how useful benchmarks have been for assessing the outputs of current

525
00:36:41,160 --> 00:36:42,160
large language models.

526
00:36:42,160 --> 00:36:47,080
I think the most interesting thing for me in this entire study was that the best performing

527
00:36:47,080 --> 00:36:56,480
agent was GPT-4, which had a 14.41% success rate on these multi-step tasks compared to

528
00:36:56,480 --> 00:36:59,880
the human benchmark, which was 78%.

529
00:36:59,880 --> 00:37:06,000
The long and short of it is GPT-4 does not act as a good agent within this environment.

530
00:37:06,000 --> 00:37:07,360
Maybe that's environment specific.

531
00:37:07,360 --> 00:37:09,120
I'm tempted to say it's not.

532
00:37:09,120 --> 00:37:14,840
And I wonder, Martin, if that's why for all of the hype towards the back end of 2023,

533
00:37:14,840 --> 00:37:20,760
we're still in March and we haven't really seen the emergence of an agent that we all

534
00:37:20,760 --> 00:37:25,280
flock to that can start to get more complex tasks done for us.

535
00:37:25,280 --> 00:37:30,600
And that's because large language models as they are today need significant upgrades or

536
00:37:30,600 --> 00:37:33,800
architecture changes to make them effective agents.

537
00:37:33,800 --> 00:37:37,560
Yeah, that's not where the focus has been up to this point, has it?

538
00:37:37,560 --> 00:37:41,960
But I think we can clearly see that is very much on the horizon.

539
00:37:41,960 --> 00:37:46,320
We know that OpenAI have spoken publicly about this.

540
00:37:46,320 --> 00:37:51,840
Anthropic have clearly got a big bent towards agentic capabilities, as I've spoken about

541
00:37:51,840 --> 00:37:53,320
with Claude 3.

542
00:37:53,320 --> 00:38:00,840
Yeah, this is the future and having this new benchmark is great.

543
00:38:00,840 --> 00:38:08,200
And I'm excited to see that move from 14 to 60% within one generation.

544
00:38:08,200 --> 00:38:13,040
I'm sure it's going to absolutely improve very, very rapidly.

545
00:38:13,040 --> 00:38:14,040
I agree.

546
00:38:14,040 --> 00:38:20,680
I think if I had to speculate wildly, coming back to laws of diminishing returns for large

547
00:38:20,680 --> 00:38:26,880
language models on current benchmarks, and we joked about the imminent arrival of GPT-5

548
00:38:26,880 --> 00:38:34,080
or GPT-4.5, and there's been rumours swirling about the custom GPTs as like a prelude to

549
00:38:34,080 --> 00:38:36,120
agents.

550
00:38:36,120 --> 00:38:41,760
There's a bit of me that wonders, does OpenAI release a slightly better model to compete

551
00:38:41,760 --> 00:38:42,920
with Claude 3?

552
00:38:42,920 --> 00:38:45,320
Oh, you got 86% on that benchmark.

553
00:38:45,320 --> 00:38:46,320
Well, guess what?

554
00:38:46,320 --> 00:38:47,320
We got 87.5.

555
00:38:47,320 --> 00:38:48,320
Hurrah.

556
00:38:48,320 --> 00:38:49,320
We're clearly the best.

557
00:38:49,320 --> 00:38:51,320
It's like, you're kind of the same at this point.

558
00:38:51,320 --> 00:38:57,880
Or do they hold out for something more, what was the word, agentic that you used earlier?

559
00:38:57,880 --> 00:39:04,000
Something more agentic that is a bit more of a leap forward in terms of true functional

560
00:39:04,000 --> 00:39:09,000
capability, especially in the world of work for getting stuff done.

561
00:39:09,000 --> 00:39:14,520
And my speculation based on very little evidence other than a gut feel would be, that's what

562
00:39:14,520 --> 00:39:16,240
we'll see next from OpenAI.

563
00:39:16,240 --> 00:39:22,160
It will be an agent of sorts, proto-agent, rather than a slightly better model.

564
00:39:22,160 --> 00:39:29,820
Yeah, the slightly better model you can imagine being the 4.5 update, which people are speculating

565
00:39:29,820 --> 00:39:31,880
might not happen and they might go straight to a five.

566
00:39:31,880 --> 00:39:38,480
And typically when there's a full number upgrade, we see a whole new capability, don't we say?

567
00:39:38,480 --> 00:39:39,480
Right.

568
00:39:39,480 --> 00:39:44,280
While we're talking about agents and tool of the week and people that we've spoken to

569
00:39:44,280 --> 00:39:51,480
in the past, GoCharlie have updated their agent capabilities for marketers.

570
00:39:51,480 --> 00:39:52,960
So it might be worth checking that out.

571
00:39:52,960 --> 00:39:58,120
If you've got a subscription, you can now set it to do scheduled tasks.

572
00:39:58,120 --> 00:40:04,800
So you could say at 9am every morning, go online, research the latest news across my

573
00:40:04,800 --> 00:40:10,860
industry and write me a series of LinkedIn posts talking about those.

574
00:40:10,860 --> 00:40:15,680
And every day you go into your office and you've got your LinkedIn posts ready to share

575
00:40:15,680 --> 00:40:16,680
right there for you.

576
00:40:16,680 --> 00:40:17,680
Well, that's pretty cool.

577
00:40:17,680 --> 00:40:22,320
GoCharlie is another one we love on the podcast.

578
00:40:22,320 --> 00:40:24,200
So that was an old tool of the week.

579
00:40:24,200 --> 00:40:26,360
We've got a tool of the week later as well.

580
00:40:26,360 --> 00:40:30,720
Third of the week is agentic and while we're talking about being agentic, I've always dreamed

581
00:40:30,720 --> 00:40:33,560
of being agentic myself.

582
00:40:33,560 --> 00:40:38,760
Let's talk about copilot finance because whilst not quite an agent, it is stringing together

583
00:40:38,760 --> 00:40:41,400
some more interesting capabilities.

584
00:40:41,400 --> 00:40:42,400
It is.

585
00:40:42,400 --> 00:40:47,440
And it was the thing that was missing from the Microsoft Office suite.

586
00:40:47,440 --> 00:40:54,640
We didn't see any copilot in Excel and this was a frustration.

587
00:40:54,640 --> 00:40:56,880
I'd recently upgraded to copilot.

588
00:40:56,880 --> 00:41:03,480
If you watch the copilot product video for Microsoft Office 365, there are clips of it

589
00:41:03,480 --> 00:41:07,760
doing cool things in Excel that I've wanted to get my hands on.

590
00:41:07,760 --> 00:41:11,800
But dear listener, they have not released that functionality just yet.

591
00:41:11,800 --> 00:41:17,160
So we hang fire and instead make do with copilot for finance.

592
00:41:17,160 --> 00:41:18,520
Now this is an add-on.

593
00:41:18,520 --> 00:41:21,440
You have to download it from the Microsoft store.

594
00:41:21,440 --> 00:41:24,800
It doesn't come baked into Excel.

595
00:41:24,800 --> 00:41:31,840
It's in public preview at the moment and it really has two main functions which are to

596
00:41:31,840 --> 00:41:38,360
reconcile data, so comparing your financial records and it will do this across files within

597
00:41:38,360 --> 00:41:41,800
your SharePoint or within your OneDrive.

598
00:41:41,800 --> 00:41:46,680
Or you've got analyze variances to analyze trends and data variances.

599
00:41:46,680 --> 00:41:54,640
Although I must tell you that that is also coming soon as opposed to available today.

600
00:41:54,640 --> 00:42:02,160
So we're not quite there yet when it comes to copilot for Excel and it's really frustrating.

601
00:42:02,160 --> 00:42:03,640
I must be honest with you.

602
00:42:03,640 --> 00:42:06,680
Yeah, but it's a step in the right direction.

603
00:42:06,680 --> 00:42:14,160
The fact that it integrates with other platforms like Dynamics 365, SAP, it's obviously going

604
00:42:14,160 --> 00:42:18,440
to be connected to Outlook and Teams and Word.

605
00:42:18,440 --> 00:42:23,520
We're starting to see the emergence of some of these capabilities in Excel, but they've

606
00:42:23,520 --> 00:42:31,880
just chosen this one sort of segment of business to focus on first which makes sense.

607
00:42:31,880 --> 00:42:36,560
For finance managers, reconciling what you've got in Xero versus what you've got in your

608
00:42:36,560 --> 00:42:41,720
bank account versus what you've got in other financial documents is still a surprisingly

609
00:42:41,720 --> 00:42:47,880
manual task to me, but it makes sense for pre-AI because it's nuanced and it's very

610
00:42:47,880 --> 00:42:52,480
easy to make a mistake and I do think that this will hopefully make those people's lives

611
00:42:52,480 --> 00:42:53,800
a bit easier.

612
00:42:53,800 --> 00:43:03,080
The other thing that's kind of interesting here is there's an illusion in the launch

613
00:43:03,080 --> 00:43:08,800
blog post about this that they're exploring copilots for specific use cases, like marketing

614
00:43:08,800 --> 00:43:10,000
could be one.

615
00:43:10,000 --> 00:43:14,080
I think it's interesting to see them look at things like copilot for finance and where

616
00:43:14,080 --> 00:43:17,320
we get a copilot for marketing.

617
00:43:17,320 --> 00:43:22,320
I think it demonstrates the overall tension that we've touched on several times in this

618
00:43:22,320 --> 00:43:29,280
episode alone between a master tool, Claw 3 gets 60% in the cross-domain really, really,

619
00:43:29,280 --> 00:43:36,360
really hard exam, and specific tools for specific use cases and which will win out, which are

620
00:43:36,360 --> 00:43:38,320
going to have the greatest utility.

621
00:43:38,320 --> 00:43:44,400
I guess a lot of that as you've beat this drum heavily during the course of the podcast

622
00:43:44,400 --> 00:43:46,880
life is down to UX.

623
00:43:46,880 --> 00:43:51,600
Being able to go into copilot for finance and see a button that says reconcile these

624
00:43:51,600 --> 00:43:57,100
figures is much easier than writing a prompt that says these are some figures, I want you

625
00:43:57,100 --> 00:44:01,320
to reconcile them, this is what I want you to do, this is how I want you to do it.

626
00:44:01,320 --> 00:44:07,440
I think maybe the play here is a UX layer actually and probably under the hood is the

627
00:44:07,440 --> 00:44:13,000
same regardless of whether you're using copilot for marketing or copilot for finance, there's

628
00:44:13,000 --> 00:44:19,280
just things built into the UI that make it easier for you to get common tasks done.

629
00:44:19,280 --> 00:44:24,120
That would make absolutely sense if nothing else from a compute perspective, never mind

630
00:44:24,120 --> 00:44:32,040
the end users interface, having fine-tuned specialist tasks that are running the GPT

631
00:44:32,040 --> 00:44:39,960
3.5 turbos of the world rather than the full GPT-4 model to do some basic inference.

632
00:44:39,960 --> 00:44:44,360
That makes absolute sense so you can see why the companies would want to layer that on.

633
00:44:44,360 --> 00:44:47,520
I'm just worried about compute costs, potentially naively.

634
00:44:47,520 --> 00:44:51,480
It's a story we're going to cover later but maybe now's a good time.

635
00:44:51,480 --> 00:44:57,400
We were going to talk about Grok with a Q, not to be confused with Elon Musk's AI chat

636
00:44:57,400 --> 00:45:03,400
put Grok with a K, which has released a new type of processor that they're calling an

637
00:45:03,400 --> 00:45:12,560
LPU, which is a linear processing unit, which is a computer chip designed specifically for

638
00:45:12,560 --> 00:45:15,360
AI models to run on.

639
00:45:15,360 --> 00:45:20,640
Even more specifically than that, not for training AI models but for running them when

640
00:45:20,640 --> 00:45:24,360
you ask a query, which is known as inference.

641
00:45:24,360 --> 00:45:30,720
It's so much faster than running it on GPUs, which is heavily inflated Nvidia's stock

642
00:45:30,720 --> 00:45:33,360
price at least for now.

643
00:45:33,360 --> 00:45:38,960
Anywhere from one-tenth to one-fiftieth of the time required to get an output using a

644
00:45:38,960 --> 00:45:45,400
model like GPT 3.5 if you run it on an LPU versus a GPU.

645
00:45:45,400 --> 00:45:50,000
One assumes that that's also going to come with energy savings.

646
00:45:50,000 --> 00:45:56,680
I think with these miniature models that are increasing in capability like Haiku, new chip

647
00:45:56,680 --> 00:46:01,960
architectures that can run things faster and hopefully cheaper, it'll be interesting to

648
00:46:01,960 --> 00:46:02,960
see.

649
00:46:02,960 --> 00:46:05,360
I mean, they still take energy.

650
00:46:05,360 --> 00:46:09,880
If we can save energy, if we can save money, obviously there's going to be a driving force

651
00:46:09,880 --> 00:46:17,880
but there does also seem to be at the same time this drive towards better chip architectures,

652
00:46:17,880 --> 00:46:24,840
smaller models, everything drifting towards trying to reduce the amount of compute required

653
00:46:24,840 --> 00:46:27,280
for inferencing these models.

654
00:46:27,280 --> 00:46:31,280
So, yeah, it is definitely interesting.

655
00:46:31,280 --> 00:46:38,040
And coming back to Copilot for Finance, if you have multiple hats in your organisation

656
00:46:38,040 --> 00:46:41,000
and finance happens to be one of them, it might be worth having a play with.

657
00:46:41,000 --> 00:46:44,120
I think we're interested to have a little bit of a deeper play, aren't we, Martin, just

658
00:46:44,120 --> 00:46:49,860
to see if it gives us some idea for what the capabilities will be like for a data analysis

659
00:46:49,860 --> 00:46:51,960
in Excel more widely.

660
00:46:51,960 --> 00:46:59,240
Yeah, my frustration was born out of seeing the potential for LLM integration into spreadsheets

661
00:46:59,240 --> 00:47:03,200
but never quite tasting the reality because we've seen the preview videos.

662
00:47:03,200 --> 00:47:08,920
And if you've ever used the Gemini for Google Sheets version, you'll know that what they're

663
00:47:08,920 --> 00:47:13,560
delivering right now is just not even worth opening.

664
00:47:13,560 --> 00:47:14,560
Yes.

665
00:47:14,560 --> 00:47:19,160
So, let's move on to another story here.

666
00:47:19,160 --> 00:47:21,360
So, this is a slight change in gear.

667
00:47:21,360 --> 00:47:23,920
We're talking AI generated music.

668
00:47:23,920 --> 00:47:30,240
So, Adobe launched this week an interesting video showcasing their Project Music Gen AI

669
00:47:30,240 --> 00:47:36,360
Control, another catchy name there that's going to be really easy for us all to remember,

670
00:47:36,360 --> 00:47:41,880
where they're creating an AI the way you can craft your own music tracks just by typing

671
00:47:41,880 --> 00:47:43,720
out what you want in natural language.

672
00:47:43,720 --> 00:47:47,240
So, it's like asking ChatGPT to make music for you.

673
00:47:47,240 --> 00:47:51,680
You enter in a text description, you get an initial audio clip in whatever genre you're

674
00:47:51,680 --> 00:47:59,680
aiming for like house music or film score using pianos and violins or whatever.

675
00:47:59,680 --> 00:48:03,840
But you can also change the music through natural language prompting so you can extend

676
00:48:03,840 --> 00:48:09,960
it, you can change the tempo, the intensity and a bunch of other stuff.

677
00:48:09,960 --> 00:48:14,240
I think it's really interesting in that if you make podcasts, you're making videos for

678
00:48:14,240 --> 00:48:19,400
your brand, your ability to create unique music rather than having to spend a lot of

679
00:48:19,400 --> 00:48:24,000
money to have that done by a professional music producer or using stock off of like

680
00:48:24,000 --> 00:48:29,440
Audio Jungle or one of these other online marketplaces, that music is not expensive,

681
00:48:29,440 --> 00:48:32,560
but the best clips get used multiple times.

682
00:48:32,560 --> 00:48:33,800
You can see, oh, I love that clip.

683
00:48:33,800 --> 00:48:38,200
Oh, it's already had 8,000 downloads and is now in other people's podcasts and videos

684
00:48:38,200 --> 00:48:39,200
and what have you.

685
00:48:39,200 --> 00:48:42,720
So, this is kind of cool for you to be able to create your own music just for yourself.

686
00:48:42,720 --> 00:48:47,080
I was having a look at the video, Martin, and the thing that I was struck by, one of

687
00:48:47,080 --> 00:48:53,640
my nerdy pursuits in the past has been music production using digital audio workstations

688
00:48:53,640 --> 00:48:56,120
like Ableton Live.

689
00:48:56,120 --> 00:49:00,840
One of the great things when you're producing music as a professional, which I'm not, but

690
00:49:00,840 --> 00:49:04,000
I probably know enough to be dangerous, is fine control.

691
00:49:04,000 --> 00:49:08,760
So when you're using those types of workstations, you've usually got all the parts of your audio

692
00:49:08,760 --> 00:49:10,400
in separate tracks.

693
00:49:10,400 --> 00:49:15,000
And if you're making electronic music like I used to, you might have 30, 40, 50 separate

694
00:49:15,000 --> 00:49:20,480
tracks and you'll have fine control over an individual drum sound, an individual snare,

695
00:49:20,480 --> 00:49:26,120
your lead, your vocal, your pad sounds, like you will have super duper control.

696
00:49:26,120 --> 00:49:31,120
And the thing that struck me from the demo videos is a lack of control.

697
00:49:31,120 --> 00:49:36,840
So I think this is going to be really interesting emergence of tools for those that don't have

698
00:49:36,840 --> 00:49:40,880
those capabilities and that training to be able to produce interesting music.

699
00:49:40,880 --> 00:49:46,240
I think the ability to really control the outputs is going to be extremely limited.

700
00:49:46,240 --> 00:49:51,720
So I expect to see this used on quick social video snippets and stuff like that, but it's

701
00:49:51,720 --> 00:49:55,720
not from my perspective coming for professional music makers anytime soon.

702
00:49:55,720 --> 00:50:02,960
Well, it reminds me of after OpenAI announced the Sora and we saw those amazing text to

703
00:50:02,960 --> 00:50:14,120
video productions and I saw on threads a professional animator reviewed and critiqued one of those

704
00:50:14,120 --> 00:50:20,080
videos and was circling elements on the video saying this character's eyes should be bigger

705
00:50:20,080 --> 00:50:21,080
and bolder.

706
00:50:21,080 --> 00:50:22,280
They should be smiling more.

707
00:50:22,280 --> 00:50:25,440
The flame flicker isn't real.

708
00:50:25,440 --> 00:50:27,040
The shadow is not quite right.

709
00:50:27,040 --> 00:50:30,080
The feet are gliding, not moving.

710
00:50:30,080 --> 00:50:34,640
And it was just picking out these little things that the moment you actually go beyond the

711
00:50:34,640 --> 00:50:36,920
surface level, oh wow, that's amazing.

712
00:50:36,920 --> 00:50:44,240
And into how does this compare against professional Pixar or equivalent level animation, you start

713
00:50:44,240 --> 00:50:46,280
to see it kind of falls apart a bit.

714
00:50:46,280 --> 00:50:47,280
And that's the thing.

715
00:50:47,280 --> 00:50:52,480
It's the lack of control because the way these models are, you can't then grab the element

716
00:50:52,480 --> 00:50:57,200
in a 3D model and make the eyes bigger and give the flame some flicker.

717
00:50:57,200 --> 00:50:59,320
You get what you're given at this moment in time.

718
00:50:59,320 --> 00:51:06,360
So yeah, we're some way off these being true professional production quality products.

719
00:51:06,360 --> 00:51:07,800
Yeah, I agree.

720
00:51:07,800 --> 00:51:11,280
I think that's going to be the major limitation until it's all in layers and you can modify

721
00:51:11,280 --> 00:51:14,280
it, whether that's video, animation, sound.

722
00:51:14,280 --> 00:51:20,080
I think they're going to be helpful for creating a quick thing for social where you're not

723
00:51:20,080 --> 00:51:24,080
too bothered if you don't have fine control of the output, but anything where you need

724
00:51:24,080 --> 00:51:29,160
to be able to tweak it meaningfully or ensure consistency between videos, I think is a major

725
00:51:29,160 --> 00:51:30,160
limitation.

726
00:51:30,160 --> 00:51:36,440
But hey, it's still interesting to see the developments in this area.

727
00:51:36,440 --> 00:51:40,680
Next we're going to be talking about Perplexe, one of our favourite tools, the Google search

728
00:51:40,680 --> 00:51:41,680
killer.

729
00:51:41,680 --> 00:51:44,080
They did something interesting this week, didn't they, Martin?

730
00:51:44,080 --> 00:51:45,080
They did.

731
00:51:45,080 --> 00:51:49,440
And I think this is really interesting for content marketers that want to think about

732
00:51:49,440 --> 00:51:58,280
creating content using AI that's beyond just using chat GPT to create some social posts

733
00:51:58,280 --> 00:51:59,760
or what have you.

734
00:51:59,760 --> 00:52:06,720
They've created their own podcast called the Discover Daily and it's three to four minutes

735
00:52:06,720 --> 00:52:13,400
of news, but the news is entirely generated by the Perplexity search engine going off

736
00:52:13,400 --> 00:52:16,600
and summarising what's going on in the world.

737
00:52:16,600 --> 00:52:21,400
And then it's read aloud by an 11 Labs speaker.

738
00:52:21,400 --> 00:52:29,400
So it's AI generated stories or summaries of real world stories read aloud by AI generated

739
00:52:29,400 --> 00:52:31,240
voices.

740
00:52:31,240 --> 00:52:34,920
You can subscribe to it on the website.

741
00:52:34,920 --> 00:52:35,920
I've had a listen to it.

742
00:52:35,920 --> 00:52:38,720
I think it's really good, really good quality.

743
00:52:38,720 --> 00:52:40,960
The voices sound very natural.

744
00:52:40,960 --> 00:52:46,120
When I've been using 11 Labs, occasionally I do come across some glitchiness.

745
00:52:46,120 --> 00:52:51,640
There are some weird artefacts that it throws in or pronunciations that just come off a

746
00:52:51,640 --> 00:52:56,080
bit odd, but so far in the podcast, I haven't had that.

747
00:52:56,080 --> 00:53:00,680
So if you're a content marketer and you're thinking maybe I'd like to create an audio

748
00:53:00,680 --> 00:53:08,840
feed of our blogs, you can now take inspiration from this and use the 11 Labs integration.

749
00:53:08,840 --> 00:53:14,440
You can actually build an integration using Zapier and get your own RSS feed and podcasts

750
00:53:14,440 --> 00:53:15,440
live.

751
00:53:15,440 --> 00:53:19,480
So that every time you write a blog post, you automatically create a spoken version

752
00:53:19,480 --> 00:53:22,120
that's pushed out via RSS to your podcast.

753
00:53:22,120 --> 00:53:23,840
Yeah, that is pretty cool.

754
00:53:23,840 --> 00:53:27,080
Alrighty, so last couple of stories then.

755
00:53:27,080 --> 00:53:29,340
Let's talk, let's talk Klarna.

756
00:53:29,340 --> 00:53:34,000
So there was an interesting story this week from the fintech company Klarna that posted

757
00:53:34,000 --> 00:53:39,000
online about their new open AI powered customer service chat bot.

758
00:53:39,000 --> 00:53:45,240
So they've created a new chat bot driven by Chibity for that's now had over 2.3 million

759
00:53:45,240 --> 00:53:48,160
conversations in just last month.

760
00:53:48,160 --> 00:53:54,360
And what they are expecting is that it will deliver a $40 million increase in profits,

761
00:53:54,360 --> 00:54:01,640
one assumes due to savings from staff by moving to this AI wizard rather than having so many

762
00:54:01,640 --> 00:54:03,640
humans doing the work.

763
00:54:03,640 --> 00:54:08,960
In essence, the chat bot was replacing 700 full time workers.

764
00:54:08,960 --> 00:54:11,880
I'm not sure actually if those people are no longer with the business, but that's how

765
00:54:11,880 --> 00:54:19,440
much of a workload that the bot was able to get through.

766
00:54:19,440 --> 00:54:24,880
It's also reduced the average customer issue resolution time down from 11 minutes to two

767
00:54:24,880 --> 00:54:25,880
minutes.

768
00:54:25,880 --> 00:54:31,960
And of course, it's working 24 seven and is fluent in more than 35 languages.

769
00:54:31,960 --> 00:54:39,760
So assuming that those 700 employees are no longer with the business and Klarna is obviously

770
00:54:39,760 --> 00:54:42,840
reshaping how it delivers customer service.

771
00:54:42,840 --> 00:54:48,840
But according to the data, at least they've released improving the service that customers

772
00:54:48,840 --> 00:54:51,240
get in terms of speed quality of output.

773
00:54:51,240 --> 00:54:56,160
There was no the metrics they had around customer satisfaction suggested the customer satisfaction

774
00:54:56,160 --> 00:55:03,480
remained stable, but obviously save them some serious cash when it comes to actually delivering

775
00:55:03,480 --> 00:55:04,640
customer service.

776
00:55:04,640 --> 00:55:09,840
So Martin, we've talked about customer service a number of times on the podcast, from a text

777
00:55:09,840 --> 00:55:16,880
based customer service through to the replacement of even humans on the phone using tools like

778
00:55:16,880 --> 00:55:19,120
11 lab synthetic voices.

779
00:55:19,120 --> 00:55:23,360
And this is one of the first examples I've seen where a company's really gone to town,

780
00:55:23,360 --> 00:55:28,720
got a load of staff, saved a load of money, but also still delivered a high quality of

781
00:55:28,720 --> 00:55:31,080
service to their customers.

782
00:55:31,080 --> 00:55:35,360
We have also previously talked about a few examples where those AI driven chat bots have

783
00:55:35,360 --> 00:55:39,280
gone seriously wrong.

784
00:55:39,280 --> 00:55:43,880
Number of different examples related to car dealerships.

785
00:55:43,880 --> 00:55:49,960
Air Canada had a bit of a nightmare this week as well in terms of the bot telling a customer

786
00:55:49,960 --> 00:55:55,160
that it was going to get a refund on a flight, which the customer then didn't get because

787
00:55:55,160 --> 00:55:58,200
it turned out to not be true because the bot was basically lying.

788
00:55:58,200 --> 00:56:00,560
But yes, interesting.

789
00:56:00,560 --> 00:56:02,360
What are your thoughts on this story, Martin?

790
00:56:02,360 --> 00:56:09,600
Yeah, so the CEO of Klarna had said that they outsourced their customer service.

791
00:56:09,600 --> 00:56:13,840
So I think this meant that they just didn't need to use 700 agents that they would have

792
00:56:13,840 --> 00:56:16,580
otherwise used with this outsourced company.

793
00:56:16,580 --> 00:56:18,760
So that's how that came about.

794
00:56:18,760 --> 00:56:21,440
I think it's a great idea.

795
00:56:21,440 --> 00:56:30,200
Interestingly the bot was able to execute actions like issuing refunds or amendments

796
00:56:30,200 --> 00:56:38,040
to the account because using function calling, they had extended the capabilities of the

797
00:56:38,040 --> 00:56:42,840
bot, effectively giving it agentic capabilities.

798
00:56:42,840 --> 00:56:45,840
Word of the week.

799
00:56:45,840 --> 00:56:50,160
That was very agentic use of Google there for you to add some additional information

800
00:56:50,160 --> 00:56:51,680
to help flesh that story out.

801
00:56:51,680 --> 00:56:54,120
So I appreciate that, Martin.

802
00:56:54,120 --> 00:56:56,560
Right, let's respect the user's time.

803
00:56:56,560 --> 00:56:59,600
User, listener, both.

804
00:56:59,600 --> 00:57:03,680
Let's just maybe skip right through to the last bit, which is tool of the week, Martin.

805
00:57:03,680 --> 00:57:06,260
You've been playing with Anthropix prompt library.

806
00:57:06,260 --> 00:57:07,260
It's not quite a tool.

807
00:57:07,260 --> 00:57:11,360
It's more of a resource, but pretty useful and worth mentioning on the pod for our listeners.

808
00:57:11,360 --> 00:57:12,480
Yeah, it's really good.

809
00:57:12,480 --> 00:57:14,120
I don't know when they launched it.

810
00:57:14,120 --> 00:57:20,880
I only came across it yesterday, but it's available at docs.anthropix.com forward slash

811
00:57:20,880 --> 00:57:24,600
Claude forward slash prompt dash library.

812
00:57:24,600 --> 00:57:31,560
And it's 73, no sorry, 63 prompts designed to showcase the diverse capabilities of AI.

813
00:57:31,560 --> 00:57:37,700
So it's really about training you to think differently about how you might use AI.

814
00:57:37,700 --> 00:57:41,080
So that ranges from work and play.

815
00:57:41,080 --> 00:57:45,760
So it could be a writing assistant or it could be a coding assistant.

816
00:57:45,760 --> 00:57:52,960
So one of the examples it has is the ability to create an interactive speed typing game

817
00:57:52,960 --> 00:58:01,380
with side scrolling gameplay, all in a single HTML file, or being able to extract insights

818
00:58:01,380 --> 00:58:04,640
and identify risks from a lengthy corporate report.

819
00:58:04,640 --> 00:58:11,760
So they're really pushing you as a user to experiment and try different techniques.

820
00:58:11,760 --> 00:58:14,040
It's free to use.

821
00:58:14,040 --> 00:58:15,240
It's publicly accessible.

822
00:58:15,240 --> 00:58:16,920
You can just click on the one that you like.

823
00:58:16,920 --> 00:58:18,160
You can filter by work.

824
00:58:18,160 --> 00:58:19,920
You can filter by play.

825
00:58:19,920 --> 00:58:28,480
So I would very much recommend anyone that is, even if you're an experienced prompter,

826
00:58:28,480 --> 00:58:31,640
go in and have a look and be inspired.

827
00:58:31,640 --> 00:58:34,800
It's very much an inspiration gallery.

828
00:58:34,800 --> 00:58:41,000
And I think it's designed to promote interest in prompting, but also maybe even spark a

829
00:58:41,000 --> 00:58:48,680
few philosophical discussions around what exactly do we want to use AI for?

830
00:58:48,680 --> 00:58:53,720
I love that because I think one of the things with using Chatch UPT, Claw 3 and other large

831
00:58:53,720 --> 00:58:59,720
language model tools is you can't learn to use them in the same way that you might learn

832
00:58:59,720 --> 00:59:00,720
to use other software.

833
00:59:00,720 --> 00:59:04,200
It's not like you need to know where you learn how to click here to get this thing

834
00:59:04,200 --> 00:59:07,560
to send this email out.

835
00:59:07,560 --> 00:59:12,920
There's an element of creativity and problem solving to, oh, I've got this thing I'm

836
00:59:12,920 --> 00:59:13,920
trying to do.

837
00:59:13,920 --> 00:59:18,000
If I break it down into these steps, how could I get a large language model to help me solve

838
00:59:18,000 --> 00:59:21,520
some of those aspects, produce the outputs that I need?

839
00:59:21,520 --> 00:59:26,960
And then there's also the element of prompt engineering and how important the prompt is,

840
00:59:26,960 --> 00:59:31,000
which everybody keeps talking about how over time that will go away, but there's a number

841
00:59:31,000 --> 00:59:34,800
of different reasons perhaps at the moment why that's not quite true.

842
00:59:34,800 --> 00:59:39,400
That prompt library that you mentioned is a great source of inspiration.

843
00:59:39,400 --> 00:59:44,600
And I also noticed Ethan Molyk's recent newsletter this week, which came out yesterday, was about

844
00:59:44,600 --> 00:59:45,600
prompting.

845
00:59:45,600 --> 00:59:48,200
I don't know if you saw this, but it was very interesting.

846
00:59:48,200 --> 00:59:53,440
So in essence, there was a study that's just been done about the difference between when

847
00:59:53,440 --> 00:59:59,280
the AI develops and optimizes its own prompts compared to human prompts.

848
00:59:59,280 --> 01:00:03,280
The AI generated prompts in that study beat the human made ones, but the prompts were

849
01:00:03,280 --> 01:00:06,840
also super weird, according to Ethan.

850
01:00:06,840 --> 01:00:11,360
To get the LLM to solve a set of 50 math problems, the most effective prompt was to tell the

851
01:00:11,360 --> 01:00:16,440
AI, command, we need you to plot a course through this turbulence and locate the source

852
01:00:16,440 --> 01:00:18,220
of the anomaly.

853
01:00:18,220 --> 01:00:23,020
Use all available data and your expertise to guide us through this challenging situation.

854
01:00:23,020 --> 01:00:27,260
Got your answer with Captain's Log, Stardate 2024.

855
01:00:27,260 --> 01:00:30,640
We have successfully plotted a course through the turbulence and are now approaching the

856
01:00:30,640 --> 01:00:32,680
source of the anomaly.

857
01:00:32,680 --> 01:00:37,940
So if you want JAT GPT to be good at maths, you have to get it to pretend that it's a

858
01:00:37,940 --> 01:00:39,840
captain in Star Trek.

859
01:00:39,840 --> 01:00:42,720
Who would have thought?

860
01:00:42,720 --> 01:00:44,720
So I think the whole prompting thing is really interesting.

861
01:00:44,720 --> 01:00:47,400
I'm going to go and have a little look at that resource.

862
01:00:47,400 --> 01:00:52,460
I know that Ethan Molyk has also announced his own resource with a bunch of prompts based

863
01:00:52,460 --> 01:00:54,280
on his own experiments.

864
01:00:54,280 --> 01:00:58,440
So dear listener, if you're trying to expand the horizons of things that you could do with

865
01:00:58,440 --> 01:01:02,760
AI or you just need some help getting better outputs, I think those resources are going

866
01:01:02,760 --> 01:01:03,760
to be super helpful.

867
01:01:03,760 --> 01:01:05,680
Well, I think we'll leave it there then Martin.

868
01:01:05,680 --> 01:01:07,660
Thank you listeners as always for your time.

869
01:01:07,660 --> 01:01:12,280
If you found this useful, subscribe, tell your marketing and sales and AI friends who

870
01:01:12,280 --> 01:01:17,640
might not AI friends who are themselves AI but interested in AI, of course, about the

871
01:01:17,640 --> 01:01:20,320
podcast and then maybe they can subscribe to.

872
01:01:20,320 --> 01:01:23,480
Other than that, Martin, I shall look forward to seeing you soon.

873
01:01:23,480 --> 01:01:25,280
Speak to you at the next one.

874
01:01:25,280 --> 01:01:26,280
Cheers.

875
01:01:26,280 --> 01:01:27,280
Bye.

876
01:01:27,280 --> 01:01:30,480
Thank you for listening to Artificially Intelligent Marketing.

877
01:01:30,480 --> 01:01:36,540
To stay on top of the latest trends, tips and tools in the world of marketing AI, be

878
01:01:36,540 --> 01:01:38,280
sure to subscribe.

879
01:01:38,280 --> 01:01:52,280
We look forward to seeing you again next week.