In this episode, I'm going to be exploring the exciting new image-generation capabilities of Chat GPT-4o and how teachers can use it to create engaging, visually rich resources across the curriculum. The AIcademia podcast is a weekly podcast helping educators like you leverage AI in your everyday practice. I'm your host, Andy Fisher. And thanks for joining me.

AI art, like all generative AI, has come a long way in the last 18 months. It wasn’t so long ago that it was almost impossible to create a scene with a realistic person in it - hands were a particular challenge for the models and people invariably ended up with too few or too many fingers. This gradually improved last year but it was still tough to get consistent characters or reliable text to accompany the images and so it had limited application for classroom resources.

All that changed, however, when a new native image generation model was released by Open AI in late March. I can honestly say this has been one of the most exciting and magical AI experiences I have enjoyed since I started exploring this field, but first, here’s a short potted history of AI art creation just so we can appreciate where we have now arrived.

In the early days of Generative AI, the first models were Generative Adversarial Networks or (GANs) for short. Imagine a school for forgers where pairs are teamed up for training— the first in each pair is a generator whose job is to paint a convincing portrait of an elephant by let’s say Degas and the other member of the pair, the discriminator,  must see if the outcome is convincing or not — the discriminator gives feedback to the Generator each time so they can gradually improve by trial and error learning - they are engaged in a continuous and creative battle of wits. The generator is trying to trick the discriminator into thinking the output is a real Degas and he in turn is trying to tell the forgeries from the real artist’s work. Through this adversarial process, both sharpen their skills, culminating in the team capable of producing highly convincing images over time.

For a while GANs dominated but then along came Diffusion Models:  they employ a different approach. Diffusion models are trained on vast arrays of images of subjects to which increasing amounts of random noise are introduced until the original image is unrecognised. Then the model learns to make its way back to a demonised version of the original. With enough practice and a text prompt, it can start from noise and create an entirely original depiction of the target. Dedicated AI Art Generator platforms like Midjourney or Dalle are examples of diffusion models.

In this case, Imagine a sculptor starting with a solid block of marble— it’s initially shapeless and rough, similar to the random noise used at the start in a diffusion model. Gradually, the sculptor chips away small fragments, refining the details with each precise stroke. Over time, the block transforms into a clearly defined, detailed sculpture.


Then we have specialist LORA’s which stands for  Low-Rank Adaptation models. These are the models used to fine-tune a model to consistently produce a likeness of a specific face, object or style of art. If you’ve seen adverts for headshot generators or AI-generated influencers then this is likely the kind of model under the hood.

The model begins with a baseline diffusion model and then it is fed a series of images of the target so for example, all the thumbnails for my channel are created using a Lora trained on between 30 and 40 photographs of me. These were snapshots taken in different conditions, from different angles, I have different facial expressions and I’m wearing different outfits. The model then finds the commonalities to arrive at an ‘Andy model’ that will create a fair likeness of me each time. Then I just need to prompt for a ‘man dressed as an astronaut’ or ‘a man swimming underwater with a shark hunting him’ and as if by magic I get cinematic-style images that would have taken hours and hours of work in Photoshop. It’s a simple and cost-effective process and is the reason why I stepped away from headshot photography!

Not so long ago I photographed headshots for actors, those looking for a decent LinkedIn profile image and sometimes for businesses. It was a side hustle that I fit in around my teaching commitments and I was pretty good. My degree in Communication Studies involved modules in photography and over the years I had built up all of the equipment and skills needed to operate semi-professionally but a year ago, as I started spending more time using AI tools, I gave up my studio space, sold most of my equipment and shut down my website because I could see the writing on the wall.

Much as I enjoyed photographing clients, and while I’m sure there will still be a place for some professional studios in the future, I could see that Lora models would allow a user to upload some snaps from their phone’s camera roll, and less than an hour later they would have an array of excellent looking images. There was no travel involved, no hair styling or make-up, no need to worry about outfits and the cost was less than 10% of what most photographers have to charge to cover their costs and make a reasonable profit. It was a no-brainer. This is just one example of how there will be an inevitable shift in employment as AI continues to improve.

Businesses will often use Loras for product design or marketing purposes so that the unique features and colours of their product are consistently reproduced. Again, given the costs of product photography, these models can save thousands of pounds and can be used for R&D as well as augmenting current advertising campaigns. For teaching purposes, Lora’s probably have limited application and as we will go on to see, the native image models can do a pretty good job of giving us similar results as a Lora without needing to fine-tune the model.

Now let’s talk a bit about Tool calling so you can better appreciate the magnitude of the change that’s just happened: When you ask Copilot or ChatGPT to create an image it looks like all the work happens in the model but the LLM actually outsources the task to a dedicated art generator which produces the image, sends it back and then the LLM serves it up to us and takes all the credit.

The Art generator behind Chat GPT and Microsoft copilot is Dalle - so Dalle’s limitations become the limitations of the LLMs which use it. Have you ever had an experience like this before: You ask for an image - let’s say an illustration that represents the water cycle. The image comes back and it’s ok but the text is all garbled and one of the stages has been missed out. 

You prompt again, explaining the changes you want and the LLM chirpily admits the error and promises to fix it. A new image is generated but it is literally a NEW image - it bears little resemblance to the first one and while the missing stage in the cycle is now there, new errors have emerged and the text is even worse. You try a third time asking it to leave all text off (because you’ve decided to add it using Canva or Photoshop later) and just focus on the stages of the water cycle and this time you get a third entirely new image and it’s so close but still not quite there. The LLM is apologetic and promises to try harder but by this stage, you’ve had enough and make do with a photocopy from a textbook or a hand-drawn alternative.

The reason this is happening is that your AI model can’t see the output from Dalle - it just assumes that it has faithfully matched the prompt provided. 

It’s like a director, a cinematographer and an editor collaborating in different countries on a new feature-length movie and their communication is limited to postcards. The director based in LA messages his vision to the cinematographer in Australia who promises to bring the film to life through the lens. He shoots his dailies and then he sends the footage to the editor in France with a new set of postcards to help him craft it into the final film but the editor doesn’t have the suite of FX tools needed and some of the detail is lost in translation as the cinematographer attempts to convey the nuance of the original prompts for the editor in a new language. The editor then sends it back to the director without sending duplicate postcards to the cinematographer. The director sends a new somewhat terse postcard to the cinematographer, pointing out the ways in which the footage returned has not matched the brief. The cinematographer apologies profusely, promises to put things right and then sets about scribbling on his stack of cards, his French dictionary open and round we go again.

Given these limitations, that particular movie is unlikely to win any Oscars despite the best efforts of all parties!

For the past year, this has been the state of play with AI art. Proprietory diffusion models like Flux, Ideogram and Midjourney have been making steady improvements and there are whole courses on how to prompt each model to get the best from them but they remain both impressive and frustratingly limited at the same time.

And that brings us to the present with Open AI’s Native art generator built into the 4o model. It is available to both free and paid users although at the moment, due to massive demand, free users are limited to just 3 generations a day.

As a native model, the need for tool calling has gone - it’s as if the editor has been fired and the cinematographer has up-skilled to both shoot the footage and edit it as well. With this refined feedback loop with one shared language, all kinds of new opportunities become possible. And in this case the postcards have been dumped and the two can communication by email with attachments to help guide their collaboration.

So what are the strengths of the new Native image generation model?
Why is this version making such big waves?

For one, the image quality is leagues ahead of earlier models. We’re talking photorealistic details, clean line work, and few of the visual glitches that plagued earlier attempts. If you need a diagram of the digestive system or a representation of the Great Chain of Being, you can pretty much get a workable image in a single well crafted prompt.

Second, text integration. This is a big one. You can now embed clear, coherent text directly into images. Think labels on diagrams, headings on posters, even whole infographics—without needing to bounce into Canva or PowerPoint afterwards.

Third—and this is the bit that got my creative juices flowing—GPT-4o handles diverse visual styles brilliantly. You may have seen the flood of Studio Ghibli style images that have been doing the rounds on social media but that’s just the tip of the iceberg. You can prompt for an image in the style of a renaissance oil painting, 8 bit graphics, a pencil sketch or a watercolour and the model can handle those requests first time.

But here is what I think is most significant with the new Chat GPT Native image model - in the past the way that we prompted for an image generation was really inefficient. Each output would be entirely siloed from future generations. It would be like employing a bus load of graphic designers and each has one chance to get your prompt right before being fired and the next in line would have a go but none of them could talk to each other or learn from their mistakes. 

Now it’s far more like having the designer in the room with you. You describe what you want, they do a preliminary sketch and then you can each go back and forth until you arrive at something that matches your initial vision - or indeed improves upon your first idea because the process is collaborative. I ask for an image of King Kanute sitting on his throne as the tide washes in. The first image is good but he’s facing away from the viewer and I want his face showing so I create a new prompt using natural language - something like ‘have the king face the camera with an angry expression’. The next iteration preserves the shoreline, the king’s costume, the composition and so on are faithful to the original but now he is indeed facing forwards. As an afterthought I ask for a title in gothic script which reads ‘time and tide waits for no man’ at the bottom of the image and sure enough, the third image provides this, usually error free while again preserving the other details. 

This conversational refinement makes this model far more intuitive for classroom use. You can say “Can you make this more colourful?” or “Add a speech bubble to the character on the left,” and it just… does it.

And finally there’s the multimodal input to be considered. You can feed it both text and images, so it will match the composition, style or theme you’ve uploaded. Rather than relying on careful text prompting, you can share an example of the colour palette, the character or the setting you want to use and the model will draw on those resources in generating its output.

Put all of these features together - the quality of the images and the range of styles available, the integration of reliable text, natural language prompting, iterative collaborative workflow and multimodal inputs and you have a turbo-charged resource generation tool.

Here are just some of the ways you might put it to the test in your own classroom practice.

1 You could take your own portrait, convert it into a Pixar or anime-style character, add your name underneath and pin it on your door so that students will know that’s where to find you.

2 You can improve the visuals of any presentation or handout resource with bespoke and relevant images.

3 You can create labelled infographics and timelines to support topic revision.

4 You can create professional looking booklet covers, posters and flyers for topic guides, school events, plays, extra-curricular clubs and upcoming field trips.

5 You can create graphic novel pages with dialogue to convey a key concept.

6 You can create bespoke celebration postcards to send home when a pupil achieves a particular milestone or exceeds expectations.

7 Finally, if you are in a position to allow students access to the model, they could take these same use cases and produce compelling images themselves so they could design an illustration to go with their short story or poem. They could create a visual summary of the material covered in a unit or make their own illustrated guide to a topic of your choice.

We are really only limited by our imaginations. My advice would be to start small - challenge yourself to produce one resource this week which could add value to your learners and then build from there. Try to be clear and specific in your prompt, learn how you can iterate back and forth with the model and always keep an eye out for accuracy.

Look, there’s no question that tools like GPT-4o are going to become a bigger part of education in the months and years to come. But I think what’s most exciting about this particular update is that it gives us new ways to see—literally.
It bridges that gap between the abstract and the concrete, the imagined and the visualised. And in doing so, it gives students another way to express what they know and what they’re curious about.
Used thoughtfully, this could be a turning point for how we teach—and how our students learn.

So this brings us to the end of the first season of the AIcademia podcast - I’ll be taking a couple of weeks away from this project now as I prep materials for my GCSE students as they get ready for study leave. It will also give me a chance to reflect on what direction I’d like to take things in season 2. I’ve had some useful feedback suggesting that while the content is appreciated, the nature of the topic might be better served as a series of YouTube tutorials rather than an audio-only podcast and I’m certainly open to that idea.

Also, creating a weekly show has been a good challenge but I think it might take some pressure off and allow me to offer more value if I shift to a biweekly output of some kind so if you have any thoughts about any of these suggestions do let me know and I’ll take your views into account as I plan out season 2.

Thanks for listening – I hope you’ve found some useful takeaways from the conversation. Please do spread the word if you think others would like the show, and do check out the AIcademia Youtube channel where you’ll find practical tutorials that complement the topics covered on this podcast. Have a great week and I look forward to catching up again soon.