In this episode, I’m going to be investigating the rise of so-called ‘Deep Research’ AI models and whether they’re fit for purpose as tools for educators. The AIcademia podcast is a weekly podcast helping educators like you leverage AI in your everyday practice. I’m your host, Andy Fisher. And thanks for joining me.

If you track the development of conversational AI for consumers, we have really seen four phases - the first saw a wave of generic chatbots which were closed text to text systems. Next came a slew of multi-modal models which dramatically improved the ways we could input and output content - we could talk to the AI model directly and it could talk back to us - it could recognise uploaded pdfs, images and excel files and it could output using a range of formats too including code snippets and data visualisation. 

Next came web enabled models so that we could engage with real time information rather than be restricted to the data which the models were trained on and then most recently, the fourth wave provided us with reasoning models which leveraged inference time to think through responses, sacrificing speed of output to ensure a better quality of answer.

Now we are seeing a new series of web-enabled agentic reasoning models. These hybrid systems combine the best of these previous innovations and allow the model to take a query, use reasoning frameworks to come up with a multi step research approach and then they scrape the web for relevant sources which can then be curated to produce the best answer to that query. 

If this were outsourced to a human co-worker instead, it might look a little like this. Imagine we had a dedicated research assistant assigned to us and to whom we could offload some of the cognitive tasks we have on our to do list. I might be planning a new scheme of work on the Industrial Revolution and the way in which it reshaped the landscape of London. I task them with doing some foundational research for me and send them off for a week to compile and organise that material into a meaningful workbook with citations suitable to my Year 11 class. They come back after spending 20-30 hours scouring over 100 or more reputable online sources and deliver a 10,000 word base document which I can then tweak and refine to my specific needs. This co-worker is not serving as a replacement for me so much as a way for me to amplify my ability to explore, analyse, and iterate on available resources. This is basically what the new wave of ‘deep research’ tools enable educators to now do as part of our daily practice. They can scan and synthesize vast amounts of online information, completing multi-step research processes but unlike our hard-working classroom assistant, they can do so in a matter of minutes. 

Deep research AI tools are being touted as the next step in AI-assisted knowledge synthesis and at least 5 of major players are leading the charge in this space: Google Gemini, OpenAI, Perplexity, Microsoft Copilot and Elon Musk’s newly released Grok 3 all offer variations of these agentic models and they are confusingly named using very similar conventions - Deep Research, Deeper Thought, Deep Think and so on.

Advocates of these tools make bold claims. Sam Altman, CEO of OpenAI, has suggested that for 50 cents of compute, these models provide the equivalent of $5000 worth of work. It’s been described as “having a PhD in your pocket”—but is that just marketing hype, or are these tools genuinely revolutionising academic research?

Given that each of their offerings provide similar services, let’s begin by considering their advantages and disadvantages:

In comparison to human effort, they are very fast. What might take a competent researcher multiple hours can be executed in minutes - most estimates suggest a 100x time compression which has obvious benefits to time poor educators looking for ways to claw back some kind of a life/work balance.

Next, unlike many of the previous AI systems, they provide Citation-based responses, allowing users to go back and scrutinise the sources that the output is based on. This reduces the likelihood of hallucinations and adds some academic rigour to the process.

And finally, the models benefit from Enhanced reasoning capabilities. Benchmarks suggest there has been a significant leap in AI’s ability to navigate nuanced and complex topics. Perplexity’s Deep Research tool, for example, scores 95% on the QA accuracy benchmark (a test which is designed to measure a model’s ability to manage complex and nuanced questions) and it outperforms most other AI models in the ‘Last Human Exam,’ where it achieved 21.1%—compared to 9.4% for DeepSeek R1 and just 3.3% for ChatGPT-4.0. OpenAI’s equivalent Deep Research tool scores even higher at 26%. These deep research tools are simply better at the kind of complex research tasks which educators rely on when preparing our materials or while keeping up with our fields of specialisation. 

Ethan Mollick, a well-respected academic in the AI space, made the following comment after his first forays into using the Open AI Deep Research tool. He said:

“It weaves together difficult and contradictory concepts, finding novel connections I wouldn’t expect. It cites only quality sources and is full of accurate quotations. I would have been satisfied to see something like this from a beginning PhD student.”

Not everyone, however, is convinced. Leon Furze, in his paper Hands-on with Deep Research, describes these tools as little more than “generators of lengthy, seemingly accurate reports that no one will actually read.” He argues that they create the appearance of research, rather than facilitating meaningful analysis.

Likewise, Mark Cummins has expressed frustration at the error rate he found in the generated reports, and said that when using Deep Research he feels like he is ‘wading through epistemic pollution’. He goes on to acknowledge that such systems will improve but maintains that it’s jarring how many users seem to find the outputs compelling or game changing.

And another early adopter - Yury Molodrsov - on X states that Deep Research mimics the ‘performance of an intern analyst who produces very official-looking memos with zero insights’.

So there are clearly quite polarised views out there about the efficacy and value of these deep research tools. Having investigated the topic further, I think there are 4 issues that are diluting the confidence that some have in the current models:

First, the deep research tools don’t currently have Access to Paywalled Journals or papers and many of the best academic research is often therefore missing from a basic web crawl. For many teachers at primary or secondary school level, this might be less of an issue but for those conducting postgraduate level research it could be a deal breaker.

Next, like all current LLM models which are still based on  predictive text completion, there is always the possibility that hallucinations will occur, sometimes blending credible data with fabricated details. Teachers must remain vigilant of this possibility rather than place undue confidence in the output.

Third, some users have complained about a lack of Analytical Depth in some of the outputs – the models can summarize vast quantities of curated information, but they don’t yet generate novel insights—something crucial for high-level academic research. In this sense they should be considered as a starting point for academic study rather than as a replacement for human intellect.

And finally, users identify the lack of Source Selection Bias – these tools currently treat all sources equally. A Reddit post or a YouTube video and a highly respected peer-reviewed journal article are given the same weight in the report generated, even though one might carry far more credibility than the other.

Personally, I am non-plussed by these criticisms. I am not looking at these Deep Research tools as a proxy for a postgraduate level analyst. Instead, I’m curious whether they can be used to save the typical educator like me time and effort in carrying out everyday research tasks and I think there is absolutely an argument that this is the case.

I’ve spent the last week playing with the Perplexity Deep Research model and scouring online articles to see what prompts and use cases are being applied and the results are fascinating. 

There are a lot of users who are leveraging deep research to produce financial reports and market predictions. Then there are more pragmatic use cases - one demo focused on asking the model to come up with the best snowboard to suit their specific needs and yet another was amazed at the medical advice it generated to support his daughter’s complex care plan. Apparently the individual concerned shared the output with his professional medical team who were shocked that the suggested approach was AI generated given its nuanced understanding of cutting edge pharmaceutical interventions.

As always, users are finding ways of leveraging these tools for their specific needs and are finding value. In my case, I have run three different use cases.

First I decided to run with my example from earlier and asked for a detailed report on the ways in which the industrial revolution reshaped the landscape of London. I asked Perplexity to make the report suitable for Year 9 pupils in UK schools and to include relevant facts, statistics and resources for additional study. This is a topic I know absolutely nothing about and so I was very curious to see the output. The model thought for just shy of 4 minutes, drew on an impressive 62 sources and then generated the study guide. To my untrained eye it looks fantastic and covers population increase, the housing crisis, the forming of the canals and railway networks,  pollution, depictions of this period of history in paintings and more. It ends with some suggested discussion points and some recommended educational visits, an online interactive map and various other resources.

For the second use case, I asked for the framework for an ethical AI use policy which could be used for a Uk secondary school like the one I teach at. I crafted the initial prompt using Chat GPT and then ported it across to Perplexity. The output this time was very impressive with 28 sources used - the policy framework includes GDPR and Data protection considerations, it draws in existing published policies from other institutions as examples of best practice and is absolutely fit for purpose as a starting point for a fully realised policy document.

Finally, I returned to a use case for my current year 11 class and attached a PDF of the Edexcel board’s examiner comments for the IGCSE Literature paper. I wanted the model to scrutinise the document and complement this by scraping the web to then produce a study guide for my pupils. I wanted it to be populated by actionable takeaways on how to prepare for the anthology poetry task and how to achieve high marks in that specific question under exam conditions. This is a perfect example of something I could do myself but it would take a long time.

This time the advice was cogent, provided an accurate summary of the source material and some excellent student advice but some of the examples drew from poems which are not in the current anthology and this was presumably a consequence of outdated sources online. With more careful prompting I could have provided another PDF with the current anthology and this may have eradicated this issue - although it was easily fixed by transposing the examples with ones from the given poems being studied.

In all three examples, the output was definitely time saving and with some additional editing, will be deployable in my workplace.

Assuming you would like to dip your toe in the deep research waters, I’d like to finish by making some basic recommendations about which model to use. I’m assuming here that you are in a similar position to me as a teacher with limited financial resources to invest.

All of the research I’ve conducted to date suggests that for now, Open AI’s Deep Research tool is the  best on the market. It’s reports are longer, more detailed, draw on more sources, better quality sources and the quality of the output is superior because it uses it’s cutting edge  03 reasoning model as an engine - so this would seem to be the obvious recommendation BUT it is currently only available to those with a Pro account which will set you back an eye watering $200 per month! If you do have that kind of money then you can enjoy 100 queries a month for that investment.

In contrast Perplexity provides free account users with up to 5 queries per day using their Deep Research tool. You simply select the ‘deep research’ option from the drop down menu in the prompt field and then use it as you would any other LLM. The output is typically a little briefer than Open AI, draws on fewer sources and it doesn’t have the benefit of the most advanced reasoning model on the market but did I mention it’s free? This makes it the obvious choice for educators and if you can invest the $20 a month for the Pro account, you’ll be able to generate 500 queries a day which I imagine would exceed the needs of those with a full or part time time teaching job anyway.

I suspect that over the next couple of months, as Grok 3 becomes available, as Chat GPT 5 and Claude 3 are released and as open source deep research alternatives come to market, we will have a wider set of options available to us. Until then I’d encourage you to have a play with Perplexity’s offering and see how you can use it to save time and augment your own research needs.

Thanks for listening—I hope you’ve found some useful takeaways from the conversation. Please do spread the word if you think others would like the show, and check out the AIcademia YouTube channel, where you’ll find practical tutorials that complement the topics covered on this podcast. Have a great week, and I look forward to catching up again soon!