1
00:00:00,000 --> 00:00:09,960
Welcome to Artificially Intelligent Marketing, a weekly podcast where we stay on top of the

2
00:00:09,960 --> 00:00:15,700
latest trends, tips and tools in the world of marketing AI, helping you get the best

3
00:00:15,700 --> 00:00:18,540
results from your marketing efforts.

4
00:00:18,540 --> 00:00:23,040
Now let's join our hosts, Paul Avery and Martin Broadhurst.

5
00:00:23,040 --> 00:00:28,700
Welcome to episode 41 of Artificially Intelligent Marketing with me, Paul Avery and my lovely

6
00:00:28,700 --> 00:00:30,320
co-host Martin Broadhurst.

7
00:00:30,320 --> 00:00:31,600
How are you today, Martin?

8
00:00:31,600 --> 00:00:37,000
I am on top form and I'm just trying to stay on top of it.

9
00:00:37,000 --> 00:00:43,960
By it, I mean just this deluge of AI updates, which is great for us because we get to record

10
00:00:43,960 --> 00:00:50,440
yet another episode of the podcast, but it can be a little bit much.

11
00:00:50,440 --> 00:00:56,280
We broke our phones last night bashing WhatsApp and for listeners who have been with us for

12
00:00:56,280 --> 00:01:01,360
a while, you'll know that we tend to record about every two weeks these days, but we had

13
00:01:01,360 --> 00:01:06,520
an episode go out on Monday, it's Friday today, and we had to do another episode even though

14
00:01:06,520 --> 00:01:12,240
our last episode only came out like four days ago because this week has been crazy.

15
00:01:12,240 --> 00:01:17,000
Yesterday was super crazy and there's just loads of really important stuff to get through

16
00:01:17,000 --> 00:01:20,400
and so a little bit of a bonus episode in some ways.

17
00:01:20,400 --> 00:01:21,400
What does that mean?

18
00:01:21,400 --> 00:01:24,160
It means that this is coming to your raw listeners.

19
00:01:24,160 --> 00:01:30,840
Martin and I have not had our usual debrief and sort of bashing back and forth of figuring

20
00:01:30,840 --> 00:01:32,720
out what different stuff means.

21
00:01:32,720 --> 00:01:36,040
We haven't written many scripting elements like we normally would.

22
00:01:36,040 --> 00:01:41,160
It's just going to be a good old dynamic conversation as we try and make sense of the massive news

23
00:01:41,160 --> 00:01:42,440
this week.

24
00:01:42,440 --> 00:01:44,200
What do we mean by massive news?

25
00:01:44,200 --> 00:01:51,600
We mean that Google released Gemini 1.5 Pro, so the next version of their Gemini model,

26
00:01:51,600 --> 00:01:55,880
not the Ultra version, the Pro version, but now the Pro version is better than the Ultra

27
00:01:55,880 --> 00:01:56,880
version.

28
00:01:56,880 --> 00:01:59,960
We'll get into details on that later, but this is pretty impressive considering they

29
00:01:59,960 --> 00:02:03,800
only started telling us about Gemini in December.

30
00:02:03,800 --> 00:02:08,980
Then not to be outdone, OpenAI announced Sora, which is this crazy video generation model

31
00:02:08,980 --> 00:02:13,600
that somehow understands world physics and we'll get into what that means.

32
00:02:13,600 --> 00:02:18,680
Those are the two sort of big massive bits of information, but mixed in with all of that,

33
00:02:18,680 --> 00:02:23,160
which probably would have been big news in any other week, Stability AI released a new

34
00:02:23,160 --> 00:02:30,320
way of chaining different models together to produce images, which has a load of benefits

35
00:02:30,320 --> 00:02:34,040
and potentially new applications, which is really interesting.

36
00:02:34,040 --> 00:02:39,960
Amazon released the largest text-to-speech model yet that shows some emerging capabilities

37
00:02:39,960 --> 00:02:41,640
and we'll look into those.

38
00:02:41,640 --> 00:02:46,480
And just because that's clearly not enough news, there's rumours circulating that OpenAI

39
00:02:46,480 --> 00:02:51,760
are moving into the world of search and trying to come up with a product that would rival

40
00:02:51,760 --> 00:02:56,600
Google search, perhaps a little bit like what Perplexity does, but we just don't know.

41
00:02:56,600 --> 00:03:00,200
So that's what we've got to try and get into today, Martin.

42
00:03:00,200 --> 00:03:01,520
Crazy.

43
00:03:01,520 --> 00:03:03,120
Just a couple of stories there.

44
00:03:03,120 --> 00:03:05,160
Nothing of the major, right?

45
00:03:05,160 --> 00:03:06,160
Right.

46
00:03:06,160 --> 00:03:12,080
Let's go first and let's talk about Gemini 1.5 Pro.

47
00:03:12,080 --> 00:03:17,120
So yesterday, this is Friday in the UK, it's 1.25pm.

48
00:03:17,120 --> 00:03:22,480
Last night, Thursday night in the UK, well, Thursday afternoon-ish, Google announced Gemini

49
00:03:22,480 --> 00:03:24,240
1.5 Pro.

50
00:03:24,240 --> 00:03:29,720
So in essence, it's the next generation of their Gemini models, but in benchmarks, it

51
00:03:29,720 --> 00:03:35,600
compares really well with Gemini Ultra, which was their super duper model that they just

52
00:03:35,600 --> 00:03:39,080
released like, I don't know, a week ago.

53
00:03:39,080 --> 00:03:42,760
It uses the new mixture of experts model architecture that we're seeing.

54
00:03:42,760 --> 00:03:49,960
This is what Mistral's models use as well to improve training and make it easier to

55
00:03:49,960 --> 00:03:53,200
offer the surface because you don't run the whole model for every query.

56
00:03:53,200 --> 00:03:56,460
You use this mixture of extras model.

57
00:03:56,460 --> 00:04:02,520
Perhaps most importantly, it has a breakthrough context window of 1 million tokens, which

58
00:04:02,520 --> 00:04:08,640
for context, the biggest up until now was Claude, which had 200,000 tokens.

59
00:04:08,640 --> 00:04:14,520
And we'll go into in a bit more detail what token and contents length means in this context,

60
00:04:14,520 --> 00:04:19,520
but that's a massive jump of five times against the current industry best.

61
00:04:19,520 --> 00:04:26,760
And in research applications, it had a 10 million token context, which is mind blowing.

62
00:04:26,760 --> 00:04:31,320
And we'll talk about what that means because in essence, you know, now it can process fast

63
00:04:31,320 --> 00:04:35,240
amounts of text, video, audio, images, code all in one go.

64
00:04:35,240 --> 00:04:39,480
It is obviously outperforming the original Gemini Pro.

65
00:04:39,480 --> 00:04:44,240
And we should expect to see this starting to be released soon because they have already

66
00:04:44,240 --> 00:04:49,200
allowed developers and enterprise customers into preview mode to be able to actually play

67
00:04:49,200 --> 00:04:57,280
with Gemini 1.5 Pro, mostly using 128,000 token version, but the 1 million token version

68
00:04:57,280 --> 00:04:58,880
will come later.

69
00:04:58,880 --> 00:05:02,600
So there's all like this sort of bits and pieces you need to know about this.

70
00:05:02,600 --> 00:05:05,120
Martin, tell us what does this mean?

71
00:05:05,120 --> 00:05:10,080
Well, let's start with that massive context window.

72
00:05:10,080 --> 00:05:12,200
What does that enable us to do?

73
00:05:12,200 --> 00:05:19,040
Well, we can think of tokens as bits of words, or they are actually, they're not bits of

74
00:05:19,040 --> 00:05:21,880
words, that's only when it's the text based modality.

75
00:05:21,880 --> 00:05:32,520
But a token is the mathematical reference for any bit of data that the language model

76
00:05:32,520 --> 00:05:33,520
understands.

77
00:05:33,520 --> 00:05:38,200
If you get a word, that word will be chopped up into tokens.

78
00:05:38,200 --> 00:05:44,920
Typically, like 100 tokens will be about 75 words.

79
00:05:44,920 --> 00:05:50,680
But it's not just, as we're seeing with this multimodal version, it's not just words that

80
00:05:50,680 --> 00:05:57,360
get turned into tokens, it's images, audio, video, and the rest of it.

81
00:05:57,360 --> 00:06:03,000
So having a larger context window basically says that the model can handle that much more

82
00:06:03,000 --> 00:06:12,400
information for you to interrogate, converse, provide, go back and forth with, etc.

83
00:06:12,400 --> 00:06:20,440
Yeah, it's 700,000 words of text, 30,000 lines of code, 11 hours of audio or one hour of

84
00:06:20,440 --> 00:06:23,920
video, and that's for the 1 million context.

85
00:06:23,920 --> 00:06:28,840
So you can times all those numbers by 10 for the 10 million context window they were able

86
00:06:28,840 --> 00:06:31,160
to achieve in sort of research applications.

87
00:06:31,160 --> 00:06:33,560
So a lot of stuff.

88
00:06:33,560 --> 00:06:41,200
Yeah, and to, you know, that many words can sound like, well, what does that equate to?

89
00:06:41,200 --> 00:06:51,320
When Anthropic announced the 100,000 token context window, which was about a year ago,

90
00:06:51,320 --> 00:06:57,440
and I find this mad because we went from, I think it was about 8,000 tokens, it suddenly

91
00:06:57,440 --> 00:06:59,120
jumped up to 100,000.

92
00:06:59,120 --> 00:07:02,200
Everybody kind of lost their mind going, oh my God, there's so much that you can do with

93
00:07:02,200 --> 00:07:03,200
that.

94
00:07:03,200 --> 00:07:08,120
But when they announced that, that was described as like a novel.

95
00:07:08,120 --> 00:07:15,880
And now we're talking about 750,000 words, 10 novels.

96
00:07:15,880 --> 00:07:18,200
And that's, yeah, it's crazy.

97
00:07:18,200 --> 00:07:24,740
The more interesting thing is what they've announced it can do with that though.

98
00:07:24,740 --> 00:07:29,220
There are some people that have been doing benchmark tests on ability to recall.

99
00:07:29,220 --> 00:07:33,600
So if you, a big context window means you can throw in a book and then can it actually

100
00:07:33,600 --> 00:07:36,040
extract information from that?

101
00:07:36,040 --> 00:07:40,840
Otherwise, if it can't accurately tell you what was in the book and it's hallucinating

102
00:07:40,840 --> 00:07:45,000
things then, you know, really what's the point if you can't rely on it?

103
00:07:45,000 --> 00:07:50,800
What they found is that it's recallability, it's needle in a haystack test, which is basically

104
00:07:50,800 --> 00:07:59,360
where you plant an obscure bit of text or ask it to find one bit of detail about the

105
00:07:59,360 --> 00:08:01,160
information that you've provided it.

106
00:08:01,160 --> 00:08:04,520
You know, like what colour was somebody's shoes or whatever.

107
00:08:04,520 --> 00:08:12,800
It's something kind of a bit, not kind of integral to the body of the work itself.

108
00:08:12,800 --> 00:08:24,960
It has 99.7% recall on a million tokens, which is far and away better than any other model

109
00:08:24,960 --> 00:08:27,320
that we're seeing on the market at the moment.

110
00:08:27,320 --> 00:08:33,180
Yeah, it's really interesting because having crawled through the research paper and tried

111
00:08:33,180 --> 00:08:40,440
to get my head around, you know, what this means for marketers, that is a seriously impressive

112
00:08:40,440 --> 00:08:48,120
recall rate to the tune of you could basically give it a whole corpus of information about

113
00:08:48,120 --> 00:08:53,440
your business, say for a junior employee to come in and actually have a chat bot that

114
00:08:53,440 --> 00:08:57,800
they can speak to about all your systems, documents, processes, sales decks, whatever.

115
00:08:57,800 --> 00:09:03,920
If that was somehow encoded in that and Gemini Pro is obviously multimodal.

116
00:09:03,920 --> 00:09:10,340
So it could take recordings of meetings or screen share videos of meetings, pictures

117
00:09:10,340 --> 00:09:15,280
of whiteboards and basically answer questions by keeping all of that information in its

118
00:09:15,280 --> 00:09:22,240
context window versus using something like retrieval augmented generation, which at least

119
00:09:22,240 --> 00:09:27,640
to untrade folks like Martin and I sounds like it doesn't work that well.

120
00:09:27,640 --> 00:09:32,280
So this being able to import all this information in the context window is really powerful.

121
00:09:32,280 --> 00:09:37,960
I will say, Martin, I went digging the word hallucination as mentioned once in the entire

122
00:09:37,960 --> 00:09:43,800
research paper in reference to Claude, not in reference to itself.

123
00:09:43,800 --> 00:09:48,040
Not because it's saying that Claude hallucinates a lot, but actually it's saying Claude doesn't

124
00:09:48,040 --> 00:09:51,280
because Claude will just refuse to answer rather than hallucinate, which is one of my

125
00:09:51,280 --> 00:09:55,480
favorite things about Claude and still my model of choice for summarizing transcripts

126
00:09:55,480 --> 00:09:58,680
of calls and what have you.

127
00:09:58,680 --> 00:10:02,960
And there's a separate quote, I think it might be from Sundar Pichai, but it might be Demis

128
00:10:02,960 --> 00:10:09,200
Saras, not sure, someone at Google saying the model still hallucinates.

129
00:10:09,200 --> 00:10:11,120
And so it's interesting.

130
00:10:11,120 --> 00:10:17,360
It has a great retrieval score, but it's still going to make stuff up every now and again,

131
00:10:17,360 --> 00:10:22,480
which I wonder Martin, for all of this power, is that going to end up still making it somewhat

132
00:10:22,480 --> 00:10:23,920
limited for some applications?

133
00:10:23,920 --> 00:10:27,200
We've talked about the issues of hallucination in things like chat bots, customer facing

134
00:10:27,200 --> 00:10:28,760
chat bots, for example.

135
00:10:28,760 --> 00:10:36,240
Well, I think what you end up doing when it goes to be deployed in production in any environment

136
00:10:36,240 --> 00:10:40,640
is actually you're not just deploying a raw interface with the model.

137
00:10:40,640 --> 00:10:47,960
You're having to have layers where there is maybe a generation, you have your initial

138
00:10:47,960 --> 00:10:51,400
inference where you have your prompt, but then that's run through a series of checks

139
00:10:51,400 --> 00:10:54,680
before it presents the final output to the end user.

140
00:10:54,680 --> 00:10:58,960
And these checks are basically saying, is any of this right or wrong?

141
00:10:58,960 --> 00:11:06,200
The perplexity API at the moment has a kind of fact checking element built into it.

142
00:11:06,200 --> 00:11:10,360
So if you were doing, let's say you wanted a chat bot on your website that was also connected

143
00:11:10,360 --> 00:11:12,920
to the web, you could do that with the perplexity API.

144
00:11:12,920 --> 00:11:18,520
And I suspect that's the kind of thing that we're going to start seeing is you don't just

145
00:11:18,520 --> 00:11:20,500
necessarily interact with the raw model.

146
00:11:20,500 --> 00:11:25,080
There's a layer of steps before the output is presented.

147
00:11:25,080 --> 00:11:29,080
And Gemini already, the chat bot version Gemini already has that little Google bot that you

148
00:11:29,080 --> 00:11:32,320
can click on that will fact checks and things that have been output.

149
00:11:32,320 --> 00:11:37,120
So I think that's going to be interesting to see how that plays out in real world applications

150
00:11:37,120 --> 00:11:43,440
as to whether hallucinating about something random you've asked the model like chat GPT

151
00:11:43,440 --> 00:11:49,760
and you say, what are the 10 most important foods that dolphins eat and it makes one of

152
00:11:49,760 --> 00:11:51,720
them up, right?

153
00:11:51,720 --> 00:11:56,060
Versus asking it something specific about a corpus of information that you've given

154
00:11:56,060 --> 00:12:02,560
it and whether that retrieval of 99.7% basically to all intents and purposes means it doesn't

155
00:12:02,560 --> 00:12:07,680
really make any meaningful mistakes on information in the context window.

156
00:12:07,680 --> 00:12:10,920
Because then you end up getting two slightly different use cases, right?

157
00:12:10,920 --> 00:12:17,320
Like if you feed it the context and it can be extremely accurate about the context information

158
00:12:17,320 --> 00:12:24,240
it's been given when if and if indeed this 10 million token window is becomes a commercial

159
00:12:24,240 --> 00:12:28,160
reality which given the rate of development of all these tools, it almost certainly will

160
00:12:28,160 --> 00:12:32,640
be that's a really interesting way of solving the hallucination problem.

161
00:12:32,640 --> 00:12:36,320
I'd be in fact, I hope someone who listens to this, maybe we've got some machine learning

162
00:12:36,320 --> 00:12:41,120
experts will message us on the LinkedIn's or the Twitter's and let us know what they

163
00:12:41,120 --> 00:12:47,640
think about that because I'd be interested to see if that opens up or solidifies some

164
00:12:47,640 --> 00:12:52,040
business applications that until now hallucinations have got in the way of.

165
00:12:52,040 --> 00:12:58,320
Hallucinations are such a funny one because they are inherent in what the tool does, right?

166
00:12:58,320 --> 00:13:04,440
All the tool is just in fact a great way for people to kind of experiment with these models

167
00:13:04,440 --> 00:13:11,840
and see what's going on underneath the hoodies go to something like playground.openai.com

168
00:13:11,840 --> 00:13:16,840
and start actually interacting with the raw model, not the filtered version through chat

169
00:13:16,840 --> 00:13:17,840
GBT.

170
00:13:17,840 --> 00:13:20,480
Because when you start interacting with the raw model and you start playing around with

171
00:13:20,480 --> 00:13:27,120
things like the temperature setting, you'll see like how hallucinations come about.

172
00:13:27,120 --> 00:13:34,600
So if you put the temperature setting really low, hallucinations very limited, but also

173
00:13:34,600 --> 00:13:39,640
the range of things that it will talk to you about is creativity is limited.

174
00:13:39,640 --> 00:13:45,560
It's just going to be really literal and not.

175
00:13:45,560 --> 00:13:49,280
It's going to be very literal and very precise, should I say?

176
00:13:49,280 --> 00:13:50,280
Right.

177
00:13:50,280 --> 00:13:54,480
When you put the temperature up to, in fact, absolutely it's completely absurd and I don't

178
00:13:54,480 --> 00:13:56,340
know why they allow you to do this.

179
00:13:56,340 --> 00:14:00,640
If you change the temperature setting to two, which is the highest setting it can be, and

180
00:14:00,640 --> 00:14:03,240
then run the prompt, it's gibberish.

181
00:14:03,240 --> 00:14:06,720
It's pure gibberish, not even like a couple of words.

182
00:14:06,720 --> 00:14:09,000
It is just like random characters.

183
00:14:09,000 --> 00:14:10,920
It's bits of code.

184
00:14:10,920 --> 00:14:12,680
It's absolutely wild.

185
00:14:12,680 --> 00:14:18,180
So you say to it, write me a strapline for an ice cream shop, temperature two, it's just

186
00:14:18,180 --> 00:14:19,180
gobbled at you.

187
00:14:19,180 --> 00:14:20,180
Well, that's your problem.

188
00:14:20,180 --> 00:14:23,180
Your ice cream melted.

189
00:14:23,180 --> 00:14:29,000
I think that's cool because ultimately I think they're trying to show the range of influence

190
00:14:29,000 --> 00:14:33,880
that the temperature setting can have, which is if it's high, model performance basically

191
00:14:33,880 --> 00:14:35,440
falls apart completely.

192
00:14:35,440 --> 00:14:36,440
Absolutely.

193
00:14:36,440 --> 00:14:37,440
Right.

194
00:14:37,440 --> 00:14:38,440
Yeah.

195
00:14:38,440 --> 00:14:39,440
It is interesting.

196
00:14:39,440 --> 00:14:43,760
Let's talk a little bit about some of the examples, Martin, that they listed out on

197
00:14:43,760 --> 00:14:52,200
the Gemini 1.5 Pro launch blog post because we've talked about business applications and

198
00:14:52,200 --> 00:14:54,440
will hallucinations still be a problem?

199
00:14:54,440 --> 00:14:58,880
But crumbs, some of those examples are pretty exciting, aren't they?

200
00:14:58,880 --> 00:14:59,880
Yeah.

201
00:14:59,880 --> 00:15:07,920
So the one that caught both of our attention was the Buster Keaton video, 44 minute silent

202
00:15:07,920 --> 00:15:12,720
movie uploaded and then interrogated with the model.

203
00:15:12,720 --> 00:15:19,520
So asked to describe the plot, asked to identify certain bits of information from the plot

204
00:15:19,520 --> 00:15:21,920
or from the video.

205
00:15:21,920 --> 00:15:27,680
For example, there's a particular scene where somebody takes like a receipt or a ticket

206
00:15:27,680 --> 00:15:31,800
out of their pocket and then that's shown to camera and it's got some detail on it about

207
00:15:31,800 --> 00:15:37,200
like the business that was working and the product that was offered kind of thing.

208
00:15:37,200 --> 00:15:40,560
And they asked the model to explain and answer what was on that.

209
00:15:40,560 --> 00:15:45,720
And it does that accurately and it gives them a timestamp of the point where it happens in

210
00:15:45,720 --> 00:15:49,480
the video, which is kind of mind blowing.

211
00:15:49,480 --> 00:15:50,960
Well, you can just give it a video.

212
00:15:50,960 --> 00:15:54,080
Bear in mind, it's not trained to do this at all.

213
00:15:54,080 --> 00:15:58,600
It isn't trained on how to search and interrogate it.

214
00:15:58,600 --> 00:16:00,360
It's just next token prediction.

215
00:16:00,360 --> 00:16:03,360
I find that absolutely bonkers.

216
00:16:03,360 --> 00:16:11,200
The other thing that it can do was the multimodality element of it allows them to do a very loose

217
00:16:11,200 --> 00:16:19,680
pencil sketch, just a line drawing, put that in as a prompt and say, give me the timestamp

218
00:16:19,680 --> 00:16:21,880
for where this happens.

219
00:16:21,880 --> 00:16:27,480
And it accurately works out what the rough sketches and it is a very rough sketch.

220
00:16:27,480 --> 00:16:29,800
It's the roughest of rough sketches.

221
00:16:29,800 --> 00:16:33,200
And it says, yeah, this is the bit where you want to go.

222
00:16:33,200 --> 00:16:39,800
So it understands what's going on in a silent movie that's 44 minutes long.

223
00:16:39,800 --> 00:16:45,040
So, you know, this is no like five second animated gif.

224
00:16:45,040 --> 00:16:50,000
This is a lot of frames and it understands the context, the plot.

225
00:16:50,000 --> 00:16:51,920
It can read the text in it.

226
00:16:51,920 --> 00:17:00,320
It can, well, I would love to see and push the limits of what that can do because again,

227
00:17:00,320 --> 00:17:09,120
applications for that are quite vast in the sense that, well, you've suddenly opened up

228
00:17:09,120 --> 00:17:14,320
new ways to interrogate company data.

229
00:17:14,320 --> 00:17:18,240
So training videos, what could you do with that?

230
00:17:18,240 --> 00:17:23,160
How can you get people if you've got manuals?

231
00:17:23,160 --> 00:17:28,480
So I'm just thinking about if you've got a product and you've got some online videos

232
00:17:28,480 --> 00:17:33,200
with maintenance, the idea that then you can connect your chat bot to a front end customer

233
00:17:33,200 --> 00:17:38,160
service tool where customers can say, how do I do this thing?

234
00:17:38,160 --> 00:17:43,440
And it can locate the exact point in a library of videos and say, this is the scene that

235
00:17:43,440 --> 00:17:45,680
you're looking for.

236
00:17:45,680 --> 00:17:46,680
That's amazing.

237
00:17:46,680 --> 00:17:53,880
If you're a washing machine or a company that manufactures consumer goods and you can just

238
00:17:53,880 --> 00:17:59,280
pinpoint people to the exact spot based on people's natural language questions or people

239
00:17:59,280 --> 00:18:04,000
taking a photo of the product going, how do I fix this bit?

240
00:18:04,000 --> 00:18:11,000
Or what that opens up all sorts of avenues for customer support and engagement.

241
00:18:11,000 --> 00:18:15,960
Yes, I think I love the way all these convergent technologies start to influence each other

242
00:18:15,960 --> 00:18:24,400
because when Meta released the Oculus Quest 2 and Quest and VR headsets started to become

243
00:18:24,400 --> 00:18:27,800
at least a little bit more mainstream and maybe they will or won't with the Vision

244
00:18:27,800 --> 00:18:34,960
Pro, there was sort of, futurists would speculate on, oh, where's this all going?

245
00:18:34,960 --> 00:18:40,660
And people would imagine that we're all walking around in glasses that have a overlay.

246
00:18:40,660 --> 00:18:46,440
We see the world, but they've got a digital overlay on it and the things in our field

247
00:18:46,440 --> 00:18:49,720
of view would like pop up and tell us stuff about the world.

248
00:18:49,720 --> 00:18:54,800
Like, oh, you're running out of breakfast cereal or this is the thing that's broken

249
00:18:54,800 --> 00:18:56,320
in your washing machine, right?

250
00:18:56,320 --> 00:18:57,320
To your point.

251
00:18:57,320 --> 00:19:05,960
But a lot of the speculation at that time was about how do we miniaturize down the technology

252
00:19:05,960 --> 00:19:09,700
of a big chunky headset, which of course we haven't achieved yet, although Meta's Ray

253
00:19:09,700 --> 00:19:13,840
bang glasses are kind of interesting and proven quite popular because they're just a set of

254
00:19:13,840 --> 00:19:16,280
glasses with cameras and mic in.

255
00:19:16,280 --> 00:19:20,160
So the AI can see what you see and then speak to you, right?

256
00:19:20,160 --> 00:19:24,200
It can't overlay anything in your field of view, but maybe that would still be valuable,

257
00:19:24,200 --> 00:19:25,200
right?

258
00:19:25,200 --> 00:19:30,040
And I think this is the point, which is at the time nobody seemed to realize that, or

259
00:19:30,040 --> 00:19:33,560
I'm sure lots of people realize, but no one seems from what I could see to be talking

260
00:19:33,560 --> 00:19:40,520
about the issues that come with a computer actually understanding what you're looking

261
00:19:40,520 --> 00:19:44,720
at and what these capabilities now open up.

262
00:19:44,720 --> 00:19:49,880
When you look at the converging technologies of VR, cameras and sensors everywhere, but

263
00:19:49,880 --> 00:19:55,960
actually AI's ability to understand what it's seeing, what it's hearing, what it's reading,

264
00:19:55,960 --> 00:20:00,200
that's what just opens up a huge plethora of capabilities, the likes of which I don't

265
00:20:00,200 --> 00:20:07,520
even think we can all imagine yet, if I'm really honest, but it will be driven by convergent

266
00:20:07,520 --> 00:20:10,120
technology evolution.

267
00:20:10,120 --> 00:20:12,800
You wouldn't be able to do it without improvements in VR.

268
00:20:12,800 --> 00:20:15,100
You wouldn't be able to do it without improvements in AI.

269
00:20:15,100 --> 00:20:19,120
You wouldn't be able to do it without reductions in cost for manufacturing of all of these

270
00:20:19,120 --> 00:20:22,880
things, but all those things together, you won't be able to do it without the internet.

271
00:20:22,880 --> 00:20:26,480
The fact that we even have large language models and that we all as a human species

272
00:20:26,480 --> 00:20:31,040
went diving down this rabbit hole is because we threw a corpus of information into a single

273
00:20:31,040 --> 00:20:33,600
place that even made it possible to train the models.

274
00:20:33,600 --> 00:20:38,080
We were talking offline about YouTube and how much data there is in YouTube and how

275
00:20:38,080 --> 00:20:44,160
we as the human race have just spent the last 10, 15 years just producing all the data that's

276
00:20:44,160 --> 00:20:46,480
even going to enable all of this stuff.

277
00:20:46,480 --> 00:20:49,760
And I'm sure smart people at the time who were collecting it all probably predicted

278
00:20:49,760 --> 00:20:53,300
it, but certainly I had no idea that this is where it would lead us.

279
00:20:53,300 --> 00:20:56,680
And it's just mind blowing.

280
00:20:56,680 --> 00:20:58,840
It's all very exciting.

281
00:20:58,840 --> 00:21:03,840
Until we get our hands on it though, let's reserve judgment as is always the way with

282
00:21:03,840 --> 00:21:04,840
these things.

283
00:21:04,840 --> 00:21:07,240
I have joined the wait list.

284
00:21:07,240 --> 00:21:09,640
So that's quite exciting.

285
00:21:09,640 --> 00:21:15,280
If you are in the UK and you want to join the wait list, you do have to sign up with

286
00:21:15,280 --> 00:21:20,160
a VPN to the AI studio on Google.

287
00:21:20,160 --> 00:21:25,360
If you try and do it with a UK IP, it says, no, sorry.

288
00:21:25,360 --> 00:21:29,600
I think the earliest I'm going to get to play with it is in Magi, which is a tool we've

289
00:21:29,600 --> 00:21:34,960
talked about on the podcast before.

290
00:21:34,960 --> 00:21:40,680
I'm doing it a disservice to say it's a skin of models of chat GPT that basically allows

291
00:21:40,680 --> 00:21:41,680
you to interact with different models.

292
00:21:41,680 --> 00:21:45,040
It's got a hell of a lot more going on inside it these days than that.

293
00:21:45,040 --> 00:21:49,760
But what I really love about what the developer Dustin and the team are doing is they add

294
00:21:49,760 --> 00:21:50,840
new models all the time.

295
00:21:50,840 --> 00:21:56,160
So I can access Gemini Pro 1.0 through that system.

296
00:21:56,160 --> 00:21:59,360
And he's already promised on the Facebook group that the minute that he can roll that

297
00:21:59,360 --> 00:22:03,280
system out and Magi with 1.5, we'll get to play with it.

298
00:22:03,280 --> 00:22:06,160
So I suspect that would be my first chance.

299
00:22:06,160 --> 00:22:13,360
Talking about things that we want to play with, let's talk about Sora and OpenAI, the

300
00:22:13,360 --> 00:22:14,360
trolls.

301
00:22:14,360 --> 00:22:16,840
Gemini 1.5 Pro drops.

302
00:22:16,840 --> 00:22:20,160
Three hours later, OpenAI goes, oh, you think that's good?

303
00:22:20,160 --> 00:22:21,960
Check out this.

304
00:22:21,960 --> 00:22:25,200
Tell the listeners about Sora.

305
00:22:25,200 --> 00:22:28,240
Talk about raining on someone's parade, hey?

306
00:22:28,240 --> 00:22:32,780
Big announcement from Google and then they come out with, hey, look at all of these one

307
00:22:32,780 --> 00:22:39,560
minute long HD quality videos produced using only text inputs.

308
00:22:39,560 --> 00:22:41,120
Aren't they amazing?

309
00:22:41,120 --> 00:22:45,200
And the internet went wild.

310
00:22:45,200 --> 00:22:51,320
We cared for Gemini Pro 1.5 within five minutes of this announcement.

311
00:22:51,320 --> 00:22:52,320
It was crazy.

312
00:22:52,320 --> 00:22:53,640
My timeline was absolutely full of it.

313
00:22:53,640 --> 00:23:04,000
So yeah, this is OpenAI's text to video model, Sora, that creates realistic videos from text

314
00:23:04,000 --> 00:23:06,420
based instructions.

315
00:23:06,420 --> 00:23:10,360
As I mentioned, it's up to one minute in length.

316
00:23:10,360 --> 00:23:12,400
They are really high quality.

317
00:23:12,400 --> 00:23:18,680
They're HD videos and one of the fascinating things about them is that they have cut scenes

318
00:23:18,680 --> 00:23:19,680
in them.

319
00:23:19,680 --> 00:23:23,480
So it's not just one shot.

320
00:23:23,480 --> 00:23:31,120
It will interchange like you would see in an advert or in a film or whatever.

321
00:23:31,120 --> 00:23:38,020
When I saw that, that was the thing that barred the quality and the quality is very good.

322
00:23:38,020 --> 00:23:41,160
Seeing that blew my mind slightly.

323
00:23:41,160 --> 00:23:43,040
That was really impressive.

324
00:23:43,040 --> 00:23:46,880
A couple of other things to note about this model.

325
00:23:46,880 --> 00:23:54,840
It understands language incredibly well, enabling very accurate interpretations of the written

326
00:23:54,840 --> 00:23:56,640
prompts.

327
00:23:56,640 --> 00:24:00,800
It understands physics very well and the physical world.

328
00:24:00,800 --> 00:24:07,180
So in one of the videos, there is a stylish Japanese woman walking down a street and there's

329
00:24:07,180 --> 00:24:12,960
lots of puddles on the floor with various kind of city lighting and urban lighting in the

330
00:24:12,960 --> 00:24:19,680
background and you can see reflections in the puddles and in the windows and you'll see

331
00:24:19,680 --> 00:24:26,360
the reflections of other pedestrians within those puddles.

332
00:24:26,360 --> 00:24:31,960
So it kind of understands all of the light and the environment as well.

333
00:24:31,960 --> 00:24:34,960
It does struggle with some more complex physics.

334
00:24:34,960 --> 00:24:42,520
In fact, I think in the blog post announcing it, they talk about this and complex physics

335
00:24:42,520 --> 00:24:50,240
such as you might see the video where somebody takes a bite out of a biscuit and then when

336
00:24:50,240 --> 00:24:55,040
they remove the biscuit from their mouth, the biscuit is still a whole biscuit.

337
00:24:55,040 --> 00:24:57,380
Not got any of it.

338
00:24:57,380 --> 00:25:03,080
And I think we can all agree that biscuit physics is the next frontier of science to

339
00:25:03,080 --> 00:25:04,080
crack.

340
00:25:04,080 --> 00:25:07,320
If they can solve that, it'd be pretty sweet is all I'm going to say.

341
00:25:07,320 --> 00:25:10,080
But there is a video on the...

342
00:25:10,080 --> 00:25:14,600
So if you want to read more about this and to be honest, we're going to do our best to

343
00:25:14,600 --> 00:25:17,560
explain how awesome these videos are.

344
00:25:17,560 --> 00:25:20,240
But if you really want to know what we're talking about, you need to go read the blog

345
00:25:20,240 --> 00:25:26,000
post on the OpenAI site, which also has a link to a longer article that goes into a

346
00:25:26,000 --> 00:25:29,000
bit more detail about how the model works and what it's capable of.

347
00:25:29,000 --> 00:25:36,400
And in the second one, the more detailed one, there is a video where someone bites a burger

348
00:25:36,400 --> 00:25:40,560
and the burger comes out of the mouth, bitten.

349
00:25:40,560 --> 00:25:44,520
So there are clearly weaknesses in the model, but there are some incredible things that

350
00:25:44,520 --> 00:25:46,640
it can definitely do.

351
00:25:46,640 --> 00:25:50,720
It's been built using the diffusion models and the transformer architecture.

352
00:25:50,720 --> 00:25:56,120
So diffusion model is what the DALI 3 model is built on and transformers is what's really

353
00:25:56,120 --> 00:25:57,400
underpinning everything.

354
00:25:57,400 --> 00:26:00,360
NPT, the T stands for transformers, right?

355
00:26:00,360 --> 00:26:06,920
So this is the underlying technology that they've gone all in on.

356
00:26:06,920 --> 00:26:14,560
And there were some interesting notes in there about what this means for where AI is heading

357
00:26:14,560 --> 00:26:18,480
in terms of understanding the physical world.

358
00:26:18,480 --> 00:26:19,480
Yeah.

359
00:26:19,480 --> 00:26:23,000
I mean, there's two quotes that we're going to read out verbatim.

360
00:26:23,000 --> 00:26:27,520
One of them is from the launch blog post and one is from the sort of more technical deep

361
00:26:27,520 --> 00:26:30,240
dive of how this all works.

362
00:26:30,240 --> 00:26:34,520
They mean the same thing in slightly different ways, but I just think it's interesting to

363
00:26:34,520 --> 00:26:39,600
think about them, especially because we have to work under the assumption that whatever

364
00:26:39,600 --> 00:26:45,620
we see today, the labs, the people in the labs doing the work are six to 12 months ahead

365
00:26:45,620 --> 00:26:46,800
of us, right?

366
00:26:46,800 --> 00:26:49,000
So they are on the cutting edge.

367
00:26:49,000 --> 00:26:53,920
Sam and I were talking off air as it relates to Sora about how a couple of months ago,

368
00:26:53,920 --> 00:26:58,920
maybe back end of 2024, Sam Altman said he'd seen something that he felt was the next step

369
00:26:58,920 --> 00:27:01,800
change forward that kind of blew his mind.

370
00:27:01,800 --> 00:27:03,660
And I'd have to guess it's this.

371
00:27:03,660 --> 00:27:08,480
If it's not this and there's something even more powerful than this, then holy crumbs

372
00:27:08,480 --> 00:27:11,760
were in for a pretty interesting three or four months.

373
00:27:11,760 --> 00:27:13,060
But here's the quotes, right?

374
00:27:13,060 --> 00:27:18,700
So the first one, which is at the bottom of the research preview is, we believe the capabilities

375
00:27:18,700 --> 00:27:23,800
Sora has today demonstrate that continued scaling of video models is a promising path

376
00:27:23,800 --> 00:27:29,520
towards the development of capable simulators of the physical and digital world and the

377
00:27:29,520 --> 00:27:32,400
objects, animals and people that live within them.

378
00:27:32,400 --> 00:27:33,400
Right?

379
00:27:33,400 --> 00:27:36,760
Quote one, hold your thoughts, Martin, because I know you've been thinking about this a lot,

380
00:27:36,760 --> 00:27:37,760
right?

381
00:27:37,760 --> 00:27:42,420
Quote number two, which is at the bottom of the launch post, Sora serves as a foundation

382
00:27:42,420 --> 00:27:46,960
for models that can understand and simulate the real world, a capability we believe will

383
00:27:46,960 --> 00:27:51,400
be an important milestone for achieving AGI.

384
00:27:51,400 --> 00:27:55,880
This model, as Martin just alluded to, is a GPT.

385
00:27:55,880 --> 00:28:01,360
It predicts the next thing based on the things that came before it.

386
00:28:01,360 --> 00:28:07,860
How on earth do we see this emerging capability to sort of understand the physics of the world

387
00:28:07,860 --> 00:28:12,040
and output them based on a text based prompt?

388
00:28:12,040 --> 00:28:13,920
I mean, Martin, thoughts?

389
00:28:13,920 --> 00:28:17,120
The researchers have really been confident about this for some time.

390
00:28:17,120 --> 00:28:18,120
I know there's some dispute.

391
00:28:18,120 --> 00:28:24,720
If you were to listen to the likes of Jan LeCun from Meta, he would argue that there

392
00:28:24,720 --> 00:28:31,200
are limitations to this, but clearly OpenAI see this as the direction of travel and then

393
00:28:31,200 --> 00:28:35,420
it's a question of scale and compute.

394
00:28:35,420 --> 00:28:42,600
If you keep throwing more and more of that at it, these transformer models will get them

395
00:28:42,600 --> 00:28:46,120
a long way towards AGI.

396
00:28:46,120 --> 00:28:52,080
This is a really interesting step in helping these models to develop what some might call

397
00:28:52,080 --> 00:28:58,880
a world model, that idea of understanding the world around them.

398
00:28:58,880 --> 00:29:06,640
GPT-4 had some elements of, from a text based perspective, having something of a world model

399
00:29:06,640 --> 00:29:07,640
within it.

400
00:29:07,640 --> 00:29:11,120
They think in the research paper when they announced it, they gave the example of with

401
00:29:11,120 --> 00:29:19,920
GPT-3, if you have described a house and said you walk through the front door, on the left

402
00:29:19,920 --> 00:29:24,920
there's a kitchen, a living room, on the right, and you kind of give a visual, or sorry, a

403
00:29:24,920 --> 00:29:29,560
verbal description or written description of the layout of a house.

404
00:29:29,560 --> 00:29:36,520
And then you said things like, what was the example I gave?

405
00:29:36,520 --> 00:29:43,360
I think if you walk into the bedroom, you put something on the bed, you pick it up,

406
00:29:43,360 --> 00:29:48,800
you walk into the bathroom, where is the thing?

407
00:29:48,800 --> 00:29:51,440
And GPT-3 would be like, it's on the bed.

408
00:29:51,440 --> 00:29:56,320
Whereas GPT-4 had this understanding, it would know that you picked it up and you've now

409
00:29:56,320 --> 00:29:59,200
carried it into the bathroom.

410
00:29:59,200 --> 00:30:03,200
And it was also able to give you visual reasoning with maps.

411
00:30:03,200 --> 00:30:06,920
So there was evidence of this happening already.

412
00:30:06,920 --> 00:30:15,800
Now seeing this manifested in video is, yeah, very, very impressive.

413
00:30:15,800 --> 00:30:21,400
Yeah, like an emerging understanding of cause and effect in the real world, which is interesting.

414
00:30:21,400 --> 00:30:27,120
I mean, Microsoft released that 100 page paper, Sparks of AGI, about GPT-4 and its emerging

415
00:30:27,120 --> 00:30:28,440
capabilities.

416
00:30:28,440 --> 00:30:34,400
I mean, when Yan Le Kun doesn't think this is the route, certainly someone who knows

417
00:30:34,400 --> 00:30:35,680
what he's talking about, right?

418
00:30:35,680 --> 00:30:41,600
So the fact that there is divergent opinion in the world's leading experts in terms of

419
00:30:41,600 --> 00:30:46,800
what these models are going to be capable of and what they're even capable of now, in

420
00:30:46,800 --> 00:30:49,520
terms of really capable of is interesting.

421
00:30:49,520 --> 00:30:53,120
But I was walking the dog today, Martin, and I was trying to think about what does all

422
00:30:53,120 --> 00:30:54,640
this mean?

423
00:30:54,640 --> 00:31:01,960
And I was just trying to reflect upon how a two-year-old learns about the world, which

424
00:31:01,960 --> 00:31:08,560
is through its eyes, its ears, its sense of touch.

425
00:31:08,560 --> 00:31:13,160
So a lot of the things that a child is using as raw inputs to learn about the world is

426
00:31:13,160 --> 00:31:16,000
what we've given these large language models.

427
00:31:16,000 --> 00:31:21,840
If the GPT neural net framework is able to learn in inverted commas in a similar way

428
00:31:21,840 --> 00:31:23,960
to the human brain, that wouldn't be a surprise.

429
00:31:23,960 --> 00:31:27,680
We kind of came up with the idea based on the fact that's how the brain works.

430
00:31:27,680 --> 00:31:33,000
We've just simulated it in a different way using silicon than biological components.

431
00:31:33,000 --> 00:31:38,160
So it doesn't surprise me perhaps then that you feed it enough information.

432
00:31:38,160 --> 00:31:42,760
I should say, I think the silicon-based version we've created is like kind of crappy compared

433
00:31:42,760 --> 00:31:49,000
to biology's ability to draw general principles from very little data because our system we've

434
00:31:49,000 --> 00:31:50,080
built is crappy.

435
00:31:50,080 --> 00:31:54,440
We have to give it petabytes of data for it to be able to pull out those patterns, whereas

436
00:31:54,440 --> 00:31:57,440
somehow the human brain is able to learn them much faster and much better.

437
00:31:57,440 --> 00:32:00,680
And I'm sure as we improve our models and architecture, maybe that's where we'll see

438
00:32:00,680 --> 00:32:01,920
the biggest improvements.

439
00:32:01,920 --> 00:32:07,880
But feeding it that information and its ability to just learn like a child almost.

440
00:32:07,880 --> 00:32:14,200
And then we were talking about this off air and I was mentioning that wouldn't it be interesting

441
00:32:14,200 --> 00:32:19,520
if you took one of these models and then trained it based on two cameras worth of data, not

442
00:32:19,520 --> 00:32:20,520
one.

443
00:32:20,520 --> 00:32:21,520
Like you gave it human eyes.

444
00:32:21,520 --> 00:32:26,520
So instead of it having to infer the world from one static 2D picture, it had two pictures

445
00:32:26,520 --> 00:32:30,560
of the same thing, stereoscopic vision, so we can understand depth a bit more like we

446
00:32:30,560 --> 00:32:32,760
were and how cool that would be.

447
00:32:32,760 --> 00:32:38,200
And then we were like, oh, Meta have been recording it for at least four or five years,

448
00:32:38,200 --> 00:32:43,360
probably longer, not just from two cameras, but from like 12.

449
00:32:43,360 --> 00:32:49,800
So imagine the data repository that they've got for this type of model training of the

450
00:32:49,800 --> 00:32:50,800
real world.

451
00:32:50,800 --> 00:32:56,240
And now people are out walking about the streets in their vision pros collecting multi-camera

452
00:32:56,240 --> 00:32:58,120
data of the real world.

453
00:32:58,120 --> 00:33:04,480
So and the fact that we've been feeding all this data to YouTube for 10, 15, 20 years,

454
00:33:04,480 --> 00:33:08,100
like the amount of data out there to train these models.

455
00:33:08,100 --> 00:33:11,940
These tech companies have been so much more clever than I gave them credit for in terms

456
00:33:11,940 --> 00:33:16,340
of collecting all this data, because if they're able to now use it to give computers this

457
00:33:16,340 --> 00:33:21,560
real sense of how the world works, again, imagine the applications that's going to open

458
00:33:21,560 --> 00:33:22,560
up.

459
00:33:22,560 --> 00:33:25,240
Yeah, feed more into it, get more out, right?

460
00:33:25,240 --> 00:33:32,480
And it was interesting you mentioned about how do toddlers learn about the world and

461
00:33:32,480 --> 00:33:34,560
how do animals learn about the world?

462
00:33:34,560 --> 00:33:40,680
Because that's actually Jan Lukens point about the limitations of transformers.

463
00:33:40,680 --> 00:33:47,240
Because only so far you can go, all of this extra sensory input is what helps us build

464
00:33:47,240 --> 00:33:51,640
our understanding of the world.

465
00:33:51,640 --> 00:34:02,400
How do you get a large language model to understand things like proprioception or temperature

466
00:34:02,400 --> 00:34:11,960
and gravity and all of these external forces that we encounter and what have you, it's

467
00:34:11,960 --> 00:34:13,880
very difficult to get there.

468
00:34:13,880 --> 00:34:16,720
And the human brain can do it incredibly efficiently.

469
00:34:16,720 --> 00:34:21,360
That's the advantage that biology has, right?

470
00:34:21,360 --> 00:34:26,800
Imagine we end up doing all this work and over like the course of like a thousand years

471
00:34:26,800 --> 00:34:30,100
and then we're like, at some point we switched to biological components because they're much

472
00:34:30,100 --> 00:34:34,040
better than silicon and we basically just create you and stuff.

473
00:34:34,040 --> 00:34:36,160
That's what it's worth wanting.

474
00:34:36,160 --> 00:34:40,400
A couple of last few bits on this and then we'll get into like why this is important

475
00:34:40,400 --> 00:34:43,280
for marketing and business folk.

476
00:34:43,280 --> 00:34:48,880
Read the research paper because some of the things this tool can do are much more mind

477
00:34:48,880 --> 00:34:55,540
blowing for me than the launch blog post mentions and therefore a lot of the stuff you'll see

478
00:34:55,540 --> 00:35:01,400
being shared online for me is missing the point of some critical stuff, especially when

479
00:35:01,400 --> 00:35:03,480
it comes to things like marketing applications.

480
00:35:03,480 --> 00:35:05,680
So I'll give you an example.

481
00:35:05,680 --> 00:35:09,520
It's not just text to video, it's image to video and you can give it a video and it will

482
00:35:09,520 --> 00:35:15,640
imagine what the first 10 seconds or adding an extra 10 seconds to that video might look

483
00:35:15,640 --> 00:35:16,640
like.

484
00:35:16,640 --> 00:35:21,840
So there's lots of creative capabilities for marketers here that we'll only just start

485
00:35:21,840 --> 00:35:26,200
scratching the surface of when we get access to this.

486
00:35:26,200 --> 00:35:34,600
In some examples on the research paper, they upload like a flat 2D image of some monsters.

487
00:35:34,600 --> 00:35:40,480
It's just like a vector image of some characters, but then it animates them in a realistic way.

488
00:35:40,480 --> 00:35:42,920
Like it waves their hands and it moves their legs.

489
00:35:42,920 --> 00:35:47,760
So think of all those static graphics that you've got in all your presentations and on

490
00:35:47,760 --> 00:35:52,400
your website because hey, it would be cool if these animated, they'd be more interesting,

491
00:35:52,400 --> 00:35:56,720
but it's just not worth paying an animator to do the work because it just doesn't add

492
00:35:56,720 --> 00:35:58,360
enough value.

493
00:35:58,360 --> 00:36:03,120
Expect an explosion in animated everything just because you can.

494
00:36:03,120 --> 00:36:04,940
No ghost.

495
00:36:04,940 --> 00:36:07,220
Some of the examples are really cool.

496
00:36:07,220 --> 00:36:12,620
This extends to photorealistic video from an image.

497
00:36:12,620 --> 00:36:17,480
So there's an example of a dog wearing a beret, which I suspect was generated by, in fact,

498
00:36:17,480 --> 00:36:20,400
that was generated by Dawley3.

499
00:36:20,400 --> 00:36:26,000
And then the video is created of that dog and it just looks like a video of a dog, like

500
00:36:26,000 --> 00:36:27,000
it was a real dog.

501
00:36:27,000 --> 00:36:29,840
Like it's absolutely mind blowing.

502
00:36:29,840 --> 00:36:37,080
Then there's an example where they're able to create an infinitely looping video that

503
00:36:37,080 --> 00:36:41,360
you have to watch for about four or five cycles to get your head around where it loops, which

504
00:36:41,360 --> 00:36:42,360
is extremely clever.

505
00:36:42,360 --> 00:36:44,680
I wonder what we can do with that.

506
00:36:44,680 --> 00:36:46,780
There's a video to video editing.

507
00:36:46,780 --> 00:36:51,680
So there's a video shot of a car driving along a road in a forest.

508
00:36:51,680 --> 00:36:56,760
And then the prompt is to change the setting to be a lush jungle.

509
00:36:56,760 --> 00:36:57,760
And it looks amazing.

510
00:36:57,760 --> 00:37:01,160
Like it looks very similar to the original input video.

511
00:37:01,160 --> 00:37:04,640
Imagine some of those videos we saw in the early days of runway where people were filming

512
00:37:04,640 --> 00:37:10,200
themselves on their phone and then creating like stylized animations based on their own

513
00:37:10,200 --> 00:37:13,740
movement, but now photoreal.

514
00:37:13,740 --> 00:37:16,280
How does this change the gaming industry?

515
00:37:16,280 --> 00:37:19,800
How does this change the CGI video effects industry?

516
00:37:19,800 --> 00:37:27,560
How does this open up CGI level capabilities to brands in the B2B and B2C space that would

517
00:37:27,560 --> 00:37:32,760
have normally have gone, of course I can't afford to spend two years and millions of

518
00:37:32,760 --> 00:37:37,920
dollars on Pixar level engineers to create an animators to create this level of stuff.

519
00:37:37,920 --> 00:37:42,760
Oh, I did it when I went out in my garden with my phone and then put it through a text

520
00:37:42,760 --> 00:37:43,760
prompt.

521
00:37:43,760 --> 00:37:47,600
And then however long this model takes to render, we don't know that yet.

522
00:37:47,600 --> 00:37:49,440
Four minutes later, I get this thing.

523
00:37:49,440 --> 00:37:50,640
It's absolutely mind blown.

524
00:37:50,640 --> 00:37:52,320
The videos are a minute long.

525
00:37:52,320 --> 00:37:55,400
Most things lose can fluss coherence at 10, 15 seconds.

526
00:37:55,400 --> 00:37:57,080
So for that's mind blowing.

527
00:37:57,080 --> 00:38:01,720
And then the last thing that just like, again, I don't even know how we're going to use this.

528
00:38:01,720 --> 00:38:03,880
I just know that it's cool.

529
00:38:03,880 --> 00:38:06,360
They've got this interpolation effect.

530
00:38:06,360 --> 00:38:07,920
Have you seen this button?

531
00:38:07,920 --> 00:38:10,280
No, I missed this bit.

532
00:38:10,280 --> 00:38:11,280
Go check it out.

533
00:38:11,280 --> 00:38:15,360
So basically they've got two videos, which as far as I understand it, were both synthesized

534
00:38:15,360 --> 00:38:20,120
by Sora and then they interpolate the videos.

535
00:38:20,120 --> 00:38:24,120
And I'm not even sure I can really explain what that means.

536
00:38:24,120 --> 00:38:26,720
In one video, they're shot.

537
00:38:26,720 --> 00:38:31,720
They've either shot or I suspect synthesized a drone flying around a ruin.

538
00:38:31,720 --> 00:38:36,440
So it's a drone flying around a ruin as if it's been shot by a drone above it.

539
00:38:36,440 --> 00:38:37,440
Right.

540
00:38:37,440 --> 00:38:42,120
And then the other video is, and again, who knows how they created this, must be synthetic

541
00:38:42,120 --> 00:38:47,440
because it's a butterfly swimming, flying around a coral reef.

542
00:38:47,440 --> 00:38:49,500
Then they interpolate the videos.

543
00:38:49,500 --> 00:38:55,720
So the video starts as a drone flying around the ruins and then the drone morphs into the

544
00:38:55,720 --> 00:39:03,480
butterfly and the ruins morph to make it look like they've been created out of coral.

545
00:39:03,480 --> 00:39:06,040
Like it's crazy.

546
00:39:06,040 --> 00:39:10,480
All 1080p, all photorealistic.

547
00:39:10,480 --> 00:39:15,960
I don't even think we've started to begin to think about how you could leverage those

548
00:39:15,960 --> 00:39:18,920
types of interesting effects to do stuff.

549
00:39:18,920 --> 00:39:21,840
And we're thinking big picture creative here.

550
00:39:21,840 --> 00:39:29,280
I think if you're a B2B strategist, brand manager, creative director, you are going

551
00:39:29,280 --> 00:39:35,880
to get to be able to do cool things in your business or for your clients that were previously

552
00:39:35,880 --> 00:39:41,280
so cost prohibitive as to be ridiculous to have even thought about them.

553
00:39:41,280 --> 00:39:46,080
Prepare to unleash your imaginations because that is what this technology I believe is

554
00:39:46,080 --> 00:39:47,080
going to allow.

555
00:39:47,080 --> 00:39:50,600
And I'm sure it will have loads of problems and artifacts and you'll never, it might take

556
00:39:50,600 --> 00:39:52,820
50 prompts to get the thing that you want.

557
00:39:52,820 --> 00:39:58,640
It's still going to be faster and less expensive than hiring 420 people, film crew and like

558
00:39:58,640 --> 00:40:02,360
42 animators of Pixar level quality.

559
00:40:02,360 --> 00:40:03,960
I just think it's cool.

560
00:40:03,960 --> 00:40:06,400
Shall we move on to our next story?

561
00:40:06,400 --> 00:40:07,400
Oh, sorry, Martin.

562
00:40:07,400 --> 00:40:15,960
Just in conclusion there, I mean, if you are a videographer or you've been making a bit

563
00:40:15,960 --> 00:40:25,960
of income from licensing stock video footage, maybe start diversifying those revenue streams.

564
00:40:25,960 --> 00:40:27,600
Indeed.

565
00:40:27,600 --> 00:40:32,680
I remember when runways capabilities were really starting to be a bit more controllable

566
00:40:32,680 --> 00:40:37,840
and people were shooting those 90 second videos where they obviously had to stitch all the

567
00:40:37,840 --> 00:40:39,480
videos they created together.

568
00:40:39,480 --> 00:40:43,040
And as people who've played with runway and got poor results, we could only imagine how

569
00:40:43,040 --> 00:40:45,040
long that would have taken people, right?

570
00:40:45,040 --> 00:40:46,040
Right.

571
00:40:46,040 --> 00:40:47,040
A lot of it.

572
00:40:47,040 --> 00:40:48,040
Yeah, absolutely.

573
00:40:48,040 --> 00:40:56,040
I cannot wait to see the first short three, four, five minutes created with this technology.

574
00:40:56,040 --> 00:40:58,800
I bet you could do it in a day.

575
00:40:58,800 --> 00:40:59,800
Easy.

576
00:40:59,800 --> 00:41:03,920
I think the limitation then will be consistent characters and locales.

577
00:41:03,920 --> 00:41:05,440
How easy is that to do?

578
00:41:05,440 --> 00:41:09,480
And then of course, at some point we're going to want human characters that speak and interact.

579
00:41:09,480 --> 00:41:13,640
We're going to have the synthetic voices as we'll see a bit later.

580
00:41:13,640 --> 00:41:18,000
But how easy is it going to be to get a human to speak and have their mouth move in time

581
00:41:18,000 --> 00:41:19,560
with the speech and all that stuff?

582
00:41:19,560 --> 00:41:25,420
But it sounds stupid that we would even think about that as a possibility.

583
00:41:25,420 --> 00:41:31,680
But we may be months from that or maybe a year or two, but not decades if you look at

584
00:41:31,680 --> 00:41:32,680
this leap.

585
00:41:32,680 --> 00:41:33,800
I mean, it's mind blowing.

586
00:41:33,800 --> 00:41:34,800
It's mind blowing.

587
00:41:34,800 --> 00:41:35,800
Oh, right.

588
00:41:35,800 --> 00:41:38,800
Everybody take a deep breath.

589
00:41:38,800 --> 00:41:40,600
Shall we move on to the next story, mine?

590
00:41:40,600 --> 00:41:42,240
Yeah, let's go ahead and do that.

591
00:41:42,240 --> 00:41:43,240
Right.

592
00:41:43,240 --> 00:41:46,040
Let's talk stable cascade.

593
00:41:46,040 --> 00:41:47,040
Any other week.

594
00:41:47,040 --> 00:41:48,960
I think this would be really, really cool and interesting.

595
00:41:48,960 --> 00:41:52,240
But compared to the things we've talked about already today, I guess it's almost kind of

596
00:41:52,240 --> 00:41:54,240
a little bit less interesting than it might have been.

597
00:41:54,240 --> 00:41:59,160
But it's still important, I think, for us marketers to know about because Stability

598
00:41:59,160 --> 00:42:06,400
AI have released a new text to image generation model, which has some slightly different things

599
00:42:06,400 --> 00:42:08,440
about it compared to the models we've seen so far.

600
00:42:08,440 --> 00:42:14,440
So unlike stable diffusion, the sort of normal model, the standard model that most tools

601
00:42:14,440 --> 00:42:19,600
are using, which is just one single large model, stable cascade is basically a three

602
00:42:19,600 --> 00:42:25,040
stage architecture of smaller models that work together creatively called stage A, stage

603
00:42:25,040 --> 00:42:29,080
B and stage C. At least it does what it says on the tin.

604
00:42:29,080 --> 00:42:30,080
Right.

605
00:42:30,080 --> 00:42:35,200
Basically, the way that it works is it makes it extremely efficient to train and fine tune

606
00:42:35,200 --> 00:42:38,760
the model because instead of working with one massive model, you're working with three

607
00:42:38,760 --> 00:42:42,040
smaller models that interact.

608
00:42:42,040 --> 00:42:48,160
And some of the data around this is that there's 16 times cost reduction over fine tuning and

609
00:42:48,160 --> 00:42:50,680
equivalently sized stable diffusion model.

610
00:42:50,680 --> 00:42:51,920
Why is this important?

611
00:42:51,920 --> 00:42:57,520
Because if you want to fine tune a model as a developer or for a custom use case and you

612
00:42:57,520 --> 00:43:01,560
want to feed it images and information to help shape what it outputs, obviously you

613
00:43:01,560 --> 00:43:04,120
don't want that to be too expensive.

614
00:43:04,120 --> 00:43:09,260
The image quality that produces is extremely high quality.

615
00:43:09,260 --> 00:43:15,100
So it's not like they've lost any quality in terms of introducing this new way of doing

616
00:43:15,100 --> 00:43:19,120
things and it's got image masking.

617
00:43:19,120 --> 00:43:23,040
So for example, you can like blank out a bit of an image and then it will regenerate the

618
00:43:23,040 --> 00:43:24,040
image in that space.

619
00:43:24,040 --> 00:43:28,840
So in the example, there's a cat, they remove the cat's head and upper body and then they

620
00:43:28,840 --> 00:43:31,440
output it as a dog in the same locale.

621
00:43:31,440 --> 00:43:32,440
Works really well.

622
00:43:32,440 --> 00:43:34,840
They generate images based on line drawings.

623
00:43:34,840 --> 00:43:40,360
So we've obviously got in clip drops some stability AI driven sketching tools where you

624
00:43:40,360 --> 00:43:44,000
do a sketch and then the tool is able to produce an image based on the sketch.

625
00:43:44,000 --> 00:43:45,740
This is very, very similar to that.

626
00:43:45,740 --> 00:43:55,400
It can upscale images as well, but all of this is being done with this smaller, easier

627
00:43:55,400 --> 00:44:01,120
to manage model architecture, which is just finding new and more interesting ways to be

628
00:44:01,120 --> 00:44:09,520
able to make it possible to train models, build models and on much simpler hardware.

629
00:44:09,520 --> 00:44:15,280
I'm not sure if they mention anywhere that you can run this locally on like a computer

630
00:44:15,280 --> 00:44:18,240
rather than having to use it in the cloud.

631
00:44:18,240 --> 00:44:21,040
I suspect, I mean, you can run stable diffusion on your laptop.

632
00:44:21,040 --> 00:44:22,040
It's just slow.

633
00:44:22,040 --> 00:44:28,360
So I suspect this will allow local image generation much better than using current level of stable

634
00:44:28,360 --> 00:44:29,360
diffusion models.

635
00:44:29,360 --> 00:44:34,120
So, Hey, if you've just got a laptop with a decent GPU in it, you can probably run this

636
00:44:34,120 --> 00:44:37,880
model pretty well, but we've got all this stuff happening in video.

637
00:44:37,880 --> 00:44:43,800
We've got Gemini Pro 1.5 and all this cool stuff, but we're racing ahead, not just in

638
00:44:43,800 --> 00:44:46,560
video, but image stuff is improving.

639
00:44:46,560 --> 00:44:48,120
It's just so much going on in this area.

640
00:44:48,120 --> 00:44:52,000
So again, if you're a marketer, really thinking about the creative tool set that you've got

641
00:44:52,000 --> 00:44:55,440
available to you to do your work.

642
00:44:55,440 --> 00:44:58,000
Let's talk text to speech.

643
00:44:58,000 --> 00:45:04,360
We've done video, we've done image, we've done retrieval of information and text documents.

644
00:45:04,360 --> 00:45:06,920
Let's talk a little bit about text image.

645
00:45:06,920 --> 00:45:09,520
Martin, tell us about the news from Amazon this week.

646
00:45:09,520 --> 00:45:13,240
Well, I nearly missed this, but you brought it to my attention and I'm glad you did because

647
00:45:13,240 --> 00:45:22,000
it's a fascinating update in an area that we normally just talk about with 11 Labs and

648
00:45:22,000 --> 00:45:28,360
well, OpenAI announced their text to speech model at the developer day and that was very

649
00:45:28,360 --> 00:45:31,200
good, but it's not an area that we focus on greatly.

650
00:45:31,200 --> 00:45:39,600
So Amazon have announced a new large scale text to speech AR model called Base TTS, so

651
00:45:39,600 --> 00:45:48,600
Base Text to Speech and it exhibits some interesting emergent abilities.

652
00:45:48,600 --> 00:45:54,520
And the researchers train this model thinking that these might occur, that they might achieve

653
00:45:54,520 --> 00:45:56,040
some of these emergent abilities.

654
00:45:56,040 --> 00:46:04,360
So just to be a background on it, it's the largest text to speech model to date, 980

655
00:46:04,360 --> 00:46:13,280
million parameters trained on a hundred thousand hours of public domain speech data, 90% of

656
00:46:13,280 --> 00:46:21,080
which is English, 10% Dutch, German and Spanish.

657
00:46:21,080 --> 00:46:29,160
They found, the researchers found that training a TTS model on just 10,000 to a hundred thousand

658
00:46:29,160 --> 00:46:36,240
hours of speech can result in some emergent abilities similar to the emergent abilities

659
00:46:36,240 --> 00:46:42,960
that we see when you scale up the data and the parameters in large language models.

660
00:46:42,960 --> 00:46:50,040
So the kinds of things that it's emerged, the capabilities that have emerged from it

661
00:46:50,040 --> 00:46:58,280
include a significantly improved ability to naturally speak certain elements of complex

662
00:46:58,280 --> 00:47:08,500
sentences such as compound nouns, emotions, foreign words, the kinds of things that normally

663
00:47:08,500 --> 00:47:13,280
trip up a text to speech engine.

664
00:47:13,280 --> 00:47:22,560
It also had punctuation, so if you added a piece of script and it had like hashtag just

665
00:47:22,560 --> 00:47:30,360
saying, that phrase would trip up TTS models historically, but now they can do that.

666
00:47:30,360 --> 00:47:39,240
It has uncanny naturalness, can mimic speaker characteristics with just a few seconds of

667
00:47:39,240 --> 00:47:48,240
reference audio, all the kind of stuff that is kind of almost there with the existing

668
00:47:48,240 --> 00:47:49,240
models.

669
00:47:49,240 --> 00:47:53,360
And I'm really excited to get my hands on this because I've actually just upgraded my

670
00:47:53,360 --> 00:48:02,040
11 Labs subscription to get the fine tuned voice clone, the professional voice cloning.

671
00:48:02,040 --> 00:48:11,320
And for that they require about three hours of audio is what they recommend.

672
00:48:11,320 --> 00:48:19,440
So I'm really interested to see how closely this new model can match somebody's voice

673
00:48:19,440 --> 00:48:24,800
with the existing models, things like 11 Labs and other text to speech providers.

674
00:48:24,800 --> 00:48:29,680
If you try and clone your voice, rather than do a pure clone from scratch of your voice,

675
00:48:29,680 --> 00:48:38,240
what actually happens is the model tries to fit your voice to one of the existing voices,

676
00:48:38,240 --> 00:48:44,600
which if you're British, one of the things that you'll often find is that it overfits

677
00:48:44,600 --> 00:48:54,360
to a British voice, which can make you sound more RP in your pronunciation, more old school

678
00:48:54,360 --> 00:48:55,360
BBC.

679
00:48:55,360 --> 00:48:57,400
Yeah, BBC in this, hush.

680
00:48:57,400 --> 00:48:58,400
Yeah.

681
00:48:58,400 --> 00:49:01,080
I believe pronunciation is a problem, right?

682
00:49:01,080 --> 00:49:08,080
Because that's not how actually most people in Britain would talk in 2024.

683
00:49:08,080 --> 00:49:17,880
So I'd be really curious to see, can it capture our actual dialect tone from just a couple

684
00:49:17,880 --> 00:49:23,480
of minutes of recording, or actually you still need quite a bit of fine tuning.

685
00:49:23,480 --> 00:49:28,240
This area for me is one of the most exciting and interesting areas of AI development for

686
00:49:28,240 --> 00:49:30,560
content marketing in particular.

687
00:49:30,560 --> 00:49:34,420
If you're a content marketer, I think there's so much that you can do with this because

688
00:49:34,420 --> 00:49:40,180
of the just vast amount of video content being consumed.

689
00:49:40,180 --> 00:49:46,200
If you can scale up your content production for social snippets and things like that,

690
00:49:46,200 --> 00:49:52,000
without you having to sit and record and re-record and all of the editing process, you can just

691
00:49:52,000 --> 00:49:57,180
get the voice that you like or get your own voice and type in some text and have it read

692
00:49:57,180 --> 00:49:59,440
aloud for you on a social post or something.

693
00:49:59,440 --> 00:50:02,140
I think there's massive potential there.

694
00:50:02,140 --> 00:50:05,600
But at the moment, I think the tools aren't quite up to the job.

695
00:50:05,600 --> 00:50:09,200
So this sounds very promising.

696
00:50:09,200 --> 00:50:11,240
Yeah, it's interesting.

697
00:50:11,240 --> 00:50:15,520
We've got an agenda item for one of our podcast episodes, but there's always, or recently

698
00:50:15,520 --> 00:50:20,240
has been too much exciting news for us to get to it where we want to deep dive into

699
00:50:20,240 --> 00:50:24,320
what does marketing and content consumption look like in a post.

700
00:50:24,320 --> 00:50:30,040
It was going to be in a post GPT-5 world, but I think you can add the base TTS model

701
00:50:30,040 --> 00:50:32,960
to that and Sora and Gemini 1.5 Pro.

702
00:50:32,960 --> 00:50:41,560
But yeah, the ability to create high quality content at scale, whether that's video, audio,

703
00:50:41,560 --> 00:50:45,360
text is basically already there, but I don't think we've quite seen the ramifications of

704
00:50:45,360 --> 00:50:47,480
yet as content marketers.

705
00:50:47,480 --> 00:50:49,960
How does that change how people consume information?

706
00:50:49,960 --> 00:50:54,000
How does that change how brands create information?

707
00:50:54,000 --> 00:50:59,560
Because I feel like I'm hooked to my devices already just trying to keep up with the information

708
00:50:59,560 --> 00:51:00,560
flow.

709
00:51:00,560 --> 00:51:07,260
Now, this just all lowering barriers to create more information that I can also engage with

710
00:51:07,260 --> 00:51:09,060
in other domains.

711
00:51:09,060 --> 00:51:13,300
One of the benefits of text to speech is listening to podcasts I love because when I'm walking

712
00:51:13,300 --> 00:51:20,320
the dog, my ears aren't really doing anything, but the rest of me is my eyes and my body.

713
00:51:20,320 --> 00:51:22,920
It's not the same as when trying to read a book or sit at a computer.

714
00:51:22,920 --> 00:51:27,840
So I do think we should get into that on one episode, and I probably don't have time today

715
00:51:27,840 --> 00:51:32,940
because we're near the end of the episode, but all the barriers dropping means more deluge

716
00:51:32,940 --> 00:51:33,940
of stuff.

717
00:51:33,940 --> 00:51:35,840
Only 24 hours in a day though.

718
00:51:35,840 --> 00:51:36,840
Well, this is it.

719
00:51:36,840 --> 00:51:41,880
So then it becomes about prioritization, the quality of the content that you're consuming.

720
00:51:41,880 --> 00:51:47,200
None of this stuff is doing any of the thought behind what would be useful for people to

721
00:51:47,200 --> 00:51:52,100
hear about what unique expertise and viewpoints do you have that you can share as a company,

722
00:51:52,100 --> 00:51:53,100
as a person.

723
00:51:53,100 --> 00:51:58,440
That puts obviously still the critical bit that's far from automatable at this point.

724
00:51:58,440 --> 00:52:04,760
But I've listened and maybe we'll in the edit, we'll add a few of the audio clips from this

725
00:52:04,760 --> 00:52:12,120
paper in because they're kind of really interesting in the way that...

726
00:52:12,120 --> 00:52:16,040
When I think about this, I'm always thinking because I've got a lot of books in my Kindle,

727
00:52:16,040 --> 00:52:20,720
I buy a lot of audio books, but I've got a lot of Kindle books where there isn't an audio

728
00:52:20,720 --> 00:52:22,640
book version and I'm like, that sucks.

729
00:52:22,640 --> 00:52:24,280
I want to listen to this when I'm walking the dock.

730
00:52:24,280 --> 00:52:28,360
I don't want to read it because I'm finding that my time to sit down and read long form

731
00:52:28,360 --> 00:52:30,440
content is ever more reduced.

732
00:52:30,440 --> 00:52:31,440
So I'm like, cool.

733
00:52:31,440 --> 00:52:36,240
When do I click a button in Kindle and listen to the audio book version of a book that doesn't

734
00:52:36,240 --> 00:52:38,160
have an audio book recording?

735
00:52:38,160 --> 00:52:40,820
One of the criticisms of that has been performance, right?

736
00:52:40,820 --> 00:52:47,120
But a couple of these audio examples, they must be trained on those types of performances

737
00:52:47,120 --> 00:52:52,680
because a lot of them are like stories talking about characters speaking to each other.

738
00:52:52,680 --> 00:52:58,240
And then the synthetic voice reading will be like, Paul entered the room and he looked

739
00:52:58,240 --> 00:53:02,240
to Martin and he whispered, Martin, we got to be careful.

740
00:53:02,240 --> 00:53:04,720
The dragon's over there.

741
00:53:04,720 --> 00:53:10,640
Martin looked frightened and like these examples that have been given, the model knows to...

742
00:53:10,640 --> 00:53:12,160
I can't even describe what that is.

743
00:53:12,160 --> 00:53:18,120
Like put on a different voice to imply that a character is speaking to another character.

744
00:53:18,120 --> 00:53:23,000
But also, and I chose the whispering example deliberately because one of the examples is

745
00:53:23,000 --> 00:53:27,520
whispering and that would be one of the main barriers, I think, because nobody wants to

746
00:53:27,520 --> 00:53:30,720
listen to an audio book that's monotone basically, right?

747
00:53:30,720 --> 00:53:34,420
And these text to speech engines, they're way better than monotone, but the ability

748
00:53:34,420 --> 00:53:37,000
to actually perform in them.

749
00:53:37,000 --> 00:53:38,000
Wow.

750
00:53:38,000 --> 00:53:44,080
I think that's where they... I mean, when base even stands for... The E in base stands

751
00:53:44,080 --> 00:53:45,520
for emergent abilities.

752
00:53:45,520 --> 00:53:51,360
So they're hanging their hat on this as we trained it on a load of stuff and it can do

753
00:53:51,360 --> 00:53:55,580
things that it shouldn't really be able to do and that we could perhaps have possibly

754
00:53:55,580 --> 00:53:59,000
predicted but it's much better than we thought it would be.

755
00:53:59,000 --> 00:54:01,160
So, what a week.

756
00:54:01,160 --> 00:54:02,160
What a week.

757
00:54:02,160 --> 00:54:07,440
And let's talk the last story, mine, because you spotted some interesting stuff online

758
00:54:07,440 --> 00:54:08,440
about this.

759
00:54:08,440 --> 00:54:11,440
And again, in the context of everything else that's happened this week, this is probably

760
00:54:11,440 --> 00:54:16,880
a bit like, well, it's just a rumor, but it probably would have had top billing last week,

761
00:54:16,880 --> 00:54:17,880
mine.

762
00:54:17,880 --> 00:54:21,160
So, let's imagine it's last week, dear listener.

763
00:54:21,160 --> 00:54:24,600
Martin, tell us about the rumor about OpenAI and Search.

764
00:54:24,600 --> 00:54:31,160
So there's a report in The Information, which is fast becoming my number one source of interesting

765
00:54:31,160 --> 00:54:37,240
content when it comes to the AI and technology field, that OpenAI is developing a web search

766
00:54:37,240 --> 00:54:46,160
product that will leverage Microsoft Bing's infrastructure in a move to basically position

767
00:54:46,160 --> 00:54:50,920
themselves up against Google in the search domain.

768
00:54:50,920 --> 00:54:53,280
The timing is interesting.

769
00:54:53,280 --> 00:55:03,120
So Satya Nadella had declared a year ago that Microsoft was making Google dance by infusing

770
00:55:03,120 --> 00:55:08,560
Bing with OpenAI's AI prowess.

771
00:55:08,560 --> 00:55:15,640
And despite that, Bing's market share of the search market has barely moved really.

772
00:55:15,640 --> 00:55:20,840
But we know that there is demand for AI-powered search.

773
00:55:20,840 --> 00:55:22,160
And how do we know that?

774
00:55:22,160 --> 00:55:28,760
Well, there is already an AI-driven search engine on the market, Aplexity.

775
00:55:28,760 --> 00:55:32,680
And Aplexity has been grabbing people's attention.

776
00:55:32,680 --> 00:55:36,360
Recently, it had a half a billion dollar valuation.

777
00:55:36,360 --> 00:55:40,760
Big investors going in, including Jeff Bezos.

778
00:55:40,760 --> 00:55:47,260
It's been, I think it was a hundred million dollars it raised in its recent round.

779
00:55:47,260 --> 00:55:53,020
It was reported that they have eight million dollars in annually recurring revenue.

780
00:55:53,020 --> 00:55:58,760
So that's people paying a subscription to access a search engine.

781
00:55:58,760 --> 00:56:07,240
So there is clearly an appetite and a shift in the way that people are searching.

782
00:56:07,240 --> 00:56:11,800
And when I see people talking about the way that they use Perplexity and when I reflect

783
00:56:11,800 --> 00:56:21,800
on my own use of Perplexity, I recognize that I go to it for answers immediately.

784
00:56:21,800 --> 00:56:24,440
I don't want to be presented with a link to find the answers.

785
00:56:24,440 --> 00:56:26,040
I want the answer given to me.

786
00:56:26,040 --> 00:56:30,720
And it looks like OpenAI have realized that that's what other people want and they are

787
00:56:30,720 --> 00:56:34,160
well positioned to give it to people.

788
00:56:34,160 --> 00:56:42,200
Yeah, it was obviously a rumor that was circulating, but then you spotted a post by AI guru, Ethan

789
00:56:42,200 --> 00:56:45,560
Molik, who we've mentioned before on the podcast.

790
00:56:45,560 --> 00:56:50,520
And one of the things that we're seeing as a consistent pattern with Ethan is that some

791
00:56:50,520 --> 00:56:56,840
OpenAI will drop some news and then he'll have a nice long post written about it for

792
00:56:56,840 --> 00:57:00,400
his Substack email newsletter because he's had it for a month already.

793
00:57:00,400 --> 00:57:04,520
And he's very, very good at keeping quiet about what he's got access to until everybody

794
00:57:04,520 --> 00:57:06,240
else in the world knows.

795
00:57:06,240 --> 00:57:14,580
But you spotted a post on LinkedIn related to this that we thought was interesting and

796
00:57:14,580 --> 00:57:18,240
maybe implied that Ethan has access already.

797
00:57:18,240 --> 00:57:22,960
I'm going to summarize the post and basically it's about sort of using iron bridges as a

798
00:57:22,960 --> 00:57:28,360
metaphor because he talks about how the first iron bridge was made using the same approaches

799
00:57:28,360 --> 00:57:33,320
that woodworkers would have used because no one really knew how to use and work with iron.

800
00:57:33,320 --> 00:57:35,800
So they applied what they knew to this new material.

801
00:57:35,800 --> 00:57:42,040
But of course, we know today that woodworking strategies and approaches are not appropriate

802
00:57:42,040 --> 00:57:43,040
for iron.

803
00:57:43,040 --> 00:57:46,080
You have to think about the problems that you're trying to solve with the material that

804
00:57:46,080 --> 00:57:48,680
you've got in hand in a different way and take different approaches.

805
00:57:48,680 --> 00:57:52,920
And then he goes on to say, for example, AI powered search is adding a new gloss to an

806
00:57:52,920 --> 00:57:53,920
old paradigm.

807
00:57:53,920 --> 00:57:59,200
Instead, we can focus on the reasons we search and try new ways to solve the problem.

808
00:57:59,200 --> 00:58:04,760
So why replicate search engines when even a GPT-4 class AI is capable of silently observing

809
00:58:04,760 --> 00:58:09,520
what you do and jumping in at the right moment to help provide informational suggestions

810
00:58:09,520 --> 00:58:12,680
if you seem to be confused or faltering?

811
00:58:12,680 --> 00:58:15,640
Well, dear listener.

812
00:58:15,640 --> 00:58:21,760
If past behaviour is a reasonable predictor of future behaviour, what is it that Ethan's

813
00:58:21,760 --> 00:58:25,080
playing with that we don't have access to yet?

814
00:58:25,080 --> 00:58:29,200
It does open some interesting possibilities.

815
00:58:29,200 --> 00:58:33,440
You've got to think that that is, you know, that's like kind of subtweet-esque.

816
00:58:33,440 --> 00:58:34,440
He's onto something.

817
00:58:34,440 --> 00:58:35,440
He knows something.

818
00:58:35,440 --> 00:58:36,720
That would be the guess.

819
00:58:36,720 --> 00:58:44,080
And I look at Gemini Chat's Google button that allows you to basically fact search the

820
00:58:44,080 --> 00:58:50,640
output that you've just generated to see if it's, you know, either wrong or can be substantiated

821
00:58:50,640 --> 00:58:53,240
by information available on the internet.

822
00:58:53,240 --> 00:58:58,720
And one has to assume that that sort of button at the very least is coming to ChatGPT.

823
00:58:58,720 --> 00:59:05,640
Right, you get an output and you'll be able to check its validity.

824
00:59:05,640 --> 00:59:10,960
And then maybe if you've got these tools baked into a tool like Word, you're writing about

825
00:59:10,960 --> 00:59:16,400
something, you guess maybe, or you get it to generate a bit of an output for a paragraph

826
00:59:16,400 --> 00:59:17,640
that you're working on.

827
00:59:17,640 --> 00:59:21,640
And then it's silently searching other information sources in the background, making suggestions

828
00:59:21,640 --> 00:59:25,160
to you like, oh, if you thought about whether this point might be important for your blog

829
00:59:25,160 --> 00:59:28,240
post as well, here's some information about it.

830
00:59:28,240 --> 00:59:32,720
We can write you your next paragraph if you want us to type of stuff, I guess.

831
00:59:32,720 --> 00:59:35,400
I mean, I'm thinking about it very much like a marketer there, I guess.

832
00:59:35,400 --> 00:59:42,120
But yeah, all of which comes at the same time as we know, it's been reported this week that

833
00:59:42,120 --> 00:59:48,320
OpenAI is also investing quite heavily in the development of AI agents.

834
00:59:48,320 --> 00:59:53,600
So talk about convergence of different technologies.

835
00:59:53,600 --> 00:59:55,440
Where will this lead us, eh?

836
00:59:55,440 --> 00:59:59,920
So I think that's probably a great place to close out the episode because it's been a

837
00:59:59,920 --> 01:00:01,680
crazy week.

838
01:00:01,680 --> 01:00:05,880
I never thought we would see this acceleration with Gemini that we've seen.

839
01:00:05,880 --> 01:00:10,480
And that context window could really open up some interesting new applications.

840
01:00:10,480 --> 01:00:13,880
We've said it before on the podcast, if you're not recording and archiving a lot of your

841
01:00:13,880 --> 01:00:18,680
meetings, client calls, customer service calls, record them.

842
01:00:18,680 --> 01:00:22,440
Even if the technologies are not quite there today to do what you might want to do with

843
01:00:22,440 --> 01:00:24,320
them, they will be tomorrow.

844
01:00:24,320 --> 01:00:26,120
So get the data now.

845
01:00:26,120 --> 01:00:28,520
So that Gemini is cool.

846
01:00:28,520 --> 01:00:30,680
Sora, I mean, who saw that coming?

847
01:00:30,680 --> 01:00:33,440
That is absolutely mind blowing.

848
01:00:33,440 --> 01:00:40,480
But your point about agents, supported search or whatever this thing is that Ethan's referencing

849
01:00:40,480 --> 01:00:45,240
to, I don't even think, well, it's 16th of February, 2024.

850
01:00:45,240 --> 01:00:49,880
We are not half done in terms of what we're going to see by the end of the year.

851
01:00:49,880 --> 01:00:55,360
And I think this week has taught me that as much as I feel like we're very close to it

852
01:00:55,360 --> 01:00:59,560
and can predict some of the things that are going to happen, I would not have seen that

853
01:00:59,560 --> 01:01:01,320
Sora video thing coming.

854
01:01:01,320 --> 01:01:06,360
No, no, that I would have expected 2025.

855
01:01:06,360 --> 01:01:08,560
That's when we'll see that kind of quality.

856
01:01:08,560 --> 01:01:09,560
Yeah.

857
01:01:09,560 --> 01:01:15,960
So we can't promise you such an earth shaking episode next week, but we can promise you

858
01:01:15,960 --> 01:01:16,960
we'll be here.

859
01:01:16,960 --> 01:01:18,720
There'll be a little bit of Derby County.

860
01:01:18,720 --> 01:01:22,040
There'll be a little bit of Banta and there'll be a little bit of AI news.

861
01:01:22,040 --> 01:01:26,720
So until then, if you find this valuable, subscribe, share it with people who you think

862
01:01:26,720 --> 01:01:27,960
might also enjoy it.

863
01:01:27,960 --> 01:01:31,360
Martin, have a fab weekend and I look forward to speaking to you soon.

864
01:01:31,360 --> 01:01:32,360
See you.

865
01:01:32,360 --> 01:01:36,000
Thank you for listening to Artificially Intelligent Marketing.

866
01:01:36,000 --> 01:01:42,040
To stay on top of the latest trends, tips and tools in the world of marketing AI, be

867
01:01:42,040 --> 01:01:43,800
sure to subscribe.

868
01:01:43,800 --> 01:01:58,400
We look forward to seeing you again next week.