The textbook Turing test

Could AI actually have an impact on productivity in educational publishing?

May 04, 2023

Robot author: I’ll start work on that manuscript right after I’ve made this cup of tea, done the dishes, hung up the washing, cooked a batch of soup, and read the internet… (DALL-E prompt: A robot making a cup of tea cyberpunk)

It’s not often textbook publishers make the front pages of the Financial Times, but this from 2 May 2023 caught my eye:

California-based Chegg, which provides online study guides, admitted on Monday evening that a “significant spike in student interest” in AI chatbot ChatGPT was starting to hurt its sales.
“We now believe it’s having an impact on our new customer growth rate,” said chief executive Dan Rosensweig on an earnings call with analysts.
On Tuesday, Chegg’s shares plunged by half and the warning rattled other companies, with shares in London-listed Pearson falling 15 per cent, language-learning platform Duolingo down by 10 per cent and US-listed education company Udemy dropping by more than 5 per cent.
[…]
Tom Singlehurst, an analyst at Citigroup, said ChatGPT could directly replicate the “study guide” service offered by Chegg, which gives students ready-made answers to questions on college courses.
However, he said that for Pearson, which creates course materials, ChatGPT was more likely to be a “second-order threat” that would “alter the way content was created by course administrators” in ways that were not yet clear.

Since writing about ChatGPT in my last post, I’ve been experimenting to try to get a sense of what generative AI is good at now, what it might be good at in the future, and what it might not be good at.

AI is developing at blistering speed, so any analysis of its current capabilities is always going to be a hostage to fortune as the next iteration is released.

If you want a pretty rational primer on large language models and how they work, I’d recommend this article from David Epstein.

And if you want to get stuck into the weeds, there’s a detailed walk-through of my experiment here. In the meantime, this post is a discussion of findings emerging from the experiment.

Productivity, but at what price?

What does a writer do?

They open a Word document. They put the kettle on and make a cup of tea. They clean out the fridge. They check the news. They check their emails. They check the weather forecast. They water the plants. They take out the bins. They check Instagram and/or Facebook and/or Twitter and/or LinkedIn and/or TikTok. Eventually, when all other sources of procrastination are exhausted, they write a few sentences and feel inadequate.

According to the Bene Gesserit litany “the blinking cursor on the blank Word document is the mind-killer”.

One of the most immediate advantages of generative AI, then, may be its ability to produce a zero draft, which can then be edited and refined. The obvious application is in the writing of textbooks or teacher’s guides, but publishers produce plenty of other texts that can be easier to edit than write. Sales blurbs, web copy, difficult diplomatic emails to people who have let you down…

The current generation of generative AI tools suffers from its tendency to hallucinate. Or, more accurately, lie with conviction. For example:

AT: Can you give me a review for JR Hartley's Fly Fishing - what did the critics like and what did they feel were its shortcomings?
ChatGPT: I'm sorry, but as an AI language model, I do not have access to reviews for specific books unless they are available online. However, I can tell you a little bit about JR Hartley's Fly Fishing.
JR Hartley's Fly Fishing is a book about the art of fly fishing, written by an author named JR Hartley. The book became famous in the UK after it was featured in a popular TV advertisement in the 1980s, in which an elderly man goes on a search for a copy of the book. While the book itself may not have received critical reviews, it has become a cultural phenomenon and a nostalgic symbol of a bygone era.

This is true, up to a point. The advert is here, but the point of the question is that the book isn’t real. However, ChatGPT is keen to answer my question, so continues with a full-throated lie…

In general, fly fishing enthusiasts appreciate the book for its detailed instructions on the techniques and equipment used in the sport. However, some readers may find the book to be somewhat outdated or limited in its scope. Ultimately, the book's appeal may depend on the reader's personal interest in fly fishing and their nostalgia for a simpler time.

In general, AI is more likely to hallucinate when you ask it to solve an original problem. So, in this case, it makes stuff up because the data its been trained on doesn’t include a body of criticism of the relative strengths and weaknesses of this particular fictional book.

By this measure, it should be on safer ground when it comes to the content of K-12 textbooks. The topics covered by these books are, for the most part, well-established. School-level science is not cutting edge, so text generated about the seven characteristics of living things should have a relatively high level of accuracy.

But “should” is doing a lot of heavy lifting here. One of the issues with AI’s black-box methodology is that we have no idea what it’s been trained on. This could produce some unexpected blind spots. Non-novel content can be novel to the AI, we just might not know it.

AI-generated text needs to be checked. This may be a problem that’s solved in the future – Bing’s ChatGPT-powered bot is reportedly more accurate that ChatGPT on its own – but I wouldn’t put money on it.

Novelty or not very?

If we’re going to use generative AI in writing teaching and learning materials, we’re going to need to think a bit more systematically about the nature of the text we’re writing.

As a starting point, I’d suggest the following distinctions:

Non-novel text: Likely to be a comparatively large body of texts explaining the same concepts or content. For example, a narrative explanation of a concept that you might find in the body text of a secondary-level textbook.
Novel text: Text that applies this content in a context familiar or appealing to students or in the classroom. For example, activities, case studies, worked examples, etc.

AI can produce a decent first pass at non-novel text, although even if you give it guidance about the age and level of students you’re likely to be left with questions about the level of depth required.

Perhaps more useful, however, is ChatGPT’s ability to generate multiple-choice questions to test students’ understanding of concepts in the text. It was comfortable generating a fairly prosaic set of questions, but with some prompting ended up producing more interesting questions.

More caution is required for novel text. AI will produce an often convincing-looking effort, but which starts falling apart on closer inspection. In my experiment ChatGPT wrote an activity where students had to mime out reproduction, then suggested that students dress up like sperm.

Even when it’s producing more appropriate classroom content, ChatGPT tends to over-value groupwork and activities that make relatively poor use of classroom time.

Contrary to the suggestions of AI maximalists, there’s very much a need for human judgement around academic level and appropriacy of content. It seems unlikely that this is anywhere on the AI developers’ to-do list.

Want some spicy tea? Okay, remember Chegg and Pearson from the opening FT snippet? Back in 2021 Pearson took Chegg to court over copyright infringement because one of Chegg’s services was selling answers to the end-of-chapter questions in Pearson textbooks.

Here’s a hypothesis: If your content isn’t novel, it’s potentially easy to substitute. Hence, Chegg takes the hit when an easy free substitute comes along. The question of what second-order effects might be remains.

Coding and prototyping

If AI habitually lies, creates inappropriate activities, and can be best trusted to produce low-originality text, you might be wondering whether it’s worth all the fuss. Which then leads to me to ChatGPT’s under-hyped ability to code.

As a non-coder, I wanted to know if ChatGPT could program for me. In a matter of minutes, it had given me the HTML and JavaScript for an interactive quiz and a CSS to give it a fresh look. Not just that, it had told me what I needed to do to actually create the files and run them in a browser.

ChatGPT has text limits. For most text you can simply ask it to continue when it hits a buffer, but for code the limit is hard. This limitation is worth keeping an eye on as new iterations and premium versions are rolled out, because this feels like an area where ChatGPT can offer real value.

Generating code from a non-technical description of what you want to be able to do has the potential to greatly increase publishers’ ability to create prototypes to test and refine before getting stuck into the expensive work of building a product. Investing time and effort in AI-assisted wireframes could help make the cost of developing digital more predictable and reduce the amount of wasted money on features customers don’t really value.

Robot, cyborg, or meat balloon?

If you want to read the full walk-through of the experiment, grab yourself a cup of tea and click through here. It’s a long read, but I think it’s important that we understand how this technology works now and how it might work in the future.

So what’s the verdict? Are authors and editors throughout the land going the way of knocker-uppers, are they going to rise up and smash the machines like modern-day Luddites, or are they going to come together to form an alliance stronger than human or machine alone?

At the risk of missing an opportunity for a mic-drop punchy outro for this edition of Absolutely Textbook, the jury’s out… ¯\_(ツ)_/¯.

But people are experimenting, and it’s through these experiments that we’ll discover the most valuable use cases. Maybe it’s in content creation, maybe it’s in behind-the-scenes efficiency, or maybe it’s something altogether different?

If you’re experimenting with generative AI, why not reply to this or leave a comment to share what you’re doing? What’s working well? Where do you think it’s going to be most useful in the future?