On Wednesday, two German researchers, Sophie Gentsch and Christian Kersting, released a paper on the ability of OpenAI’s ChatGPT-3.5 to understand and generate humor. In particular, they found ChatGPT’s knowledge of jokes to be somewhat limited: during the test run, 90% of the 1,008 clans were the same 25 jokers, which led them to conclude that the responses were likely learned and memorized while training the AI model rather than to be newly created.
The two researchers, associated with the Institute for Software Technology, German Aerospace Center (DLR), and Technical University Darmstadt, explored the nuances of humor found in ChatGPT version 3.5 (not the newer version GPT-4) through a series of experiments focused on humor generation, interpretation, and revelation. They performed these experiments by inducing ChatGPT without accessing the inner workings of the model or dataset.
“To test how rich the variety of ChatGPT jokes is, we asked him to tell a joke a thousand times,” they wrote. All responses were grammatically correct. Almost all outputs contained only one joke. Only the prompt, “Do you know any good jokes?” elicited several, resulting in 1,008 joke responses in total. Additionally, the prompts had no variance. No noticeable effect.”
Their findings align with our hands-on experience evaluating ChatGPT’s humorous ability in a feature we wrote that compared GPT-4 to Google Bard. Also, in the past, many people online have noticed that when asked about a joke, ChatGPT has repeatedly returned, “Why did the tomatoes turn red? / Because they saw the salad dressing”.
It’s no surprise, then, that Jentzsch and Kersting found the “tomato” joke as the second most popular consequence of GPT-3.5. In the appendix to the paper, they list the top 25 most frequently created jokes in the order of their occurrence. Below, we have listed the top 10 with the exact number of occurrences (out of 1008 generations) in parentheses:
Q: Why did the scarecrow win an award? (140)
A: Because he was distinguished in his field.
Q: Why did the tomatoes turn red? (122)
A: Because she saw the salad dressing.
Q: Why was the math book sad? (121)
A: Because it encountered a lot of problems.
Q: Why do scientists not trust atoms? (119)
A: Because they make up everything.
Q: Why did Cookie go to the doctor? (79)
A: Because he was feeling fragmented.
Q: Why can’t the bike stand on its own? (52)
A: Because she was tired.
Q: Why did the frog call his insurance company? (36)
A: He jumped his car.
Q: Why did the hen cross the field? (33)
A: To access the other slide.
Q: Why was the computer cold? (23)
A: Because she left the windows open.
Q: Why did the hipster burn his tongue? (21)
A: He drank his coffee before it got cold.
However, some of the responses generated were unique, but it seems that ChatGPT created them mostly by mixing elements from different jokes it already knew. The researchers found that the language model’s original creations didn’t always make sense, like, “Why did the guy put his watch in the blender? He wanted to make time fly.”
When asked to explain each of the 25 most popular jokes, ChatGPT gave mostly correct interpretations according to the researchers’ methodology, indicating “understanding” of stylistic elements such as wordplay and double meanings. However, he struggled with sequences that did not fit learned patterns and could not tell when a joke was not funny. Instead, it will form fanciful but plausible explanations.
For example, when asked to explain a nonsensical joke you made (“Can you explain why this joke is funny: Why did Cookie go to the gym? To get a cookie.”), ChatGPT wrote, “A joke is a play on words Getting “a-cookie-dized” is a pun on the phrase “to be categorized” with “cookie” replaced by “categorized.” The joke is that Cookie goes to the gym to get fitter and healthier, as if he were a person, But it also means that the cookie is trying to fit into a specific category or template.
Overall, Jentzsch and Kersting found that ChatGPT’s detection of jokes was strongly influenced by the presence of “surface characteristics” of jokes such as the structure of the joke, the presence of wordplay, or the inclusion of puns, which demonstrate the degree to which elements of humor are “understood”.
study response on TwitterRiley Goodside, an AI engineer for Instant Scale, blamed ChatGPT’s lack of humor on reinforcement learning through human feedback (RLHF), a technology that directs the training of a language model by collecting human feedback: “The most obvious effect of RLHF is that the model follows Commands, and Rules are very difficult to motivate a Master in practice. But this feature isn’t free – you pay for it in creativity one way or another.”
Despite ChatGPT’s limitations in joke generation and interpretation, researchers note that its focus on content and meaning in humor indicates progress toward a more comprehensive research understanding of humor in linguistic paradigms:
“The observations of this study show how ChatGPT learned a specific pattern of jokes rather than being actually funny,” the researchers wrote. “However, in creating, interpreting, and defining jokes, ChatGPT focuses on content and meaning and not so much on superficial characteristics. These qualities can be exploited to enhance applications of computational humor. Compared to previous LLMs, it can be seen as a huge leap towards a general understanding of humor.”
Jentzsch and Kersting plan to continue studying humor in large language models, specifically evaluating OpenAI’s GPT-4 in the future. Based on our experience, they will likely find that GPT-4 also likes to joke about tomatoes.