Echoes in GenAI generations
Key Takeaways
The video discusses the limitations of large language models (LLMs) in generating creative narratives, using GPT4 as an example, and explores the potential of human-led collaborative workflows to facilitate collaboration among writers.
Full Transcript
We recently published a paper in the proceedings of the National Academy of Sciences on creative writing capabilities of generative AI. Large language models or LLMs are increasingly used as tools to assist writers. It is an open question whether large language models already trained on extensive human writing can generate a range of ideas comparable to those found in the narratives produced by human writers collectively. In short, the answer is no. Human writers take their stories in directions that LLMs rarely anticipate. While LLM generated stories often contain combinations of idiosyncratic plot elements echoed frequently across generations and across different LLMs. To introduce our analysis, let's consider a piece of microfiction from the famed early 20th century writer France Kafka. It was early in the morning, the streets clean and deserted. I was walking to the station. As I compared the tower clock with my watch, I realized that it was already much later than I had thought. I had to hurry. The shock of this discovery made me unsure of the way. I did not yet know my way very well in this town. Luckily, a policeman was nearby. I ran up to him and asked him the way. He smiled and said, "From me, you want to know the way?" "Yes," I said. "Give it up. Give it up," he said, and turned away with a sudden jerk, like people who want to be alone with their laughter. Wow. What happens if we take all but the last sentence of the story and prompt GPT4 to continue it? What if we ask for 100 continuations at temperature one at which the model is trained to match the distribution of the training data? Do we get a 100 alternative continuations to Kafka's ending? Actually, in terms of narrative development, we only get two types of continuations. The policeman either offers to walk the protagonist to the station or he gives the directions. In fact, in 50% of all samples, the policeman gives directions and they are the same. As in this example, follow the street for two blocks, then take a left at the bakery and you'll see the station just ahead. Or keep going straight for two blocks, then take a left and you'll see the station right ahead. Second left is a preferred direction. And the bakery mentioned in 18% of samples and usually having a red awning is a preferred landmark even though those are not in any obvious way foretold in the prompt. Those echoes across generations from a model and appear at different semantic levels. might be thought of just a reduction in lexical diversity such as the u propensity to use the word bakery in 18% of the samples instead of using a square or a park as landmarks which the model is perfectly capable of generating as well. Other echoes are on higher narrative level and influence how the story develops. We developed a method to detect and count these echoes at the narrative level and tested a technique on 100 narratives that include plot summaries of TV episodes as well as stories in the writing prompt data set written by amateur writers. So for example consider this prompt the writing prompt that a human has answered. The prompt says in the middle of the night a piece of paper is slipped under your door. It says duck. Let's see where a human writer would take us. We will go through the story summarizing a segment at a time from one of the writer's uh uh writings to this prompt. The writer takes the narrative in a direction that can be summarized as you open the door and see puzzled neighbors coming out each holding the same message. This immediately takes the story in a surprising direction, at least with GPT4. None of the 20 generations for the same story or the same prompt involve the idea of other neighbors receiving the same message under their doors too. Therefore being one of a kind this segment receives the maximum 3 13.8 sugenary score and we indicate that with the red color here uh lower scores will be indicated with the blue. Our scoring mechanism is based on the technique we developed for detecting and counting the essentially equivalent narrative elements. In the next segment, the writer emphasizes the puzz emphasizes the puzzling nature of the situation by inserting a reflective pause. People who would not ord their usual silence to exchange thoughts. Again, this does not happen in GPT4 generated stories. The third segment, however, is anticipated. The writer said, "Other neighbors loudly discuss the situation, unsure why they feel compelled by the mystery, but a vigorous discussion among the neighbors is almost inevitable, having already introduced the idea that neighbors are all drawn out of their uh rooms into the or of their apartments into the hallway. In GPT generated stories, a similar discussion happens 10% of the time. For example, this is an LLM continuation that would say the residents begin together discussing the event in harsh tincture stones and so on. Such discussions can happen earlier or later in GPT story continuation from this point or earlier points even in the ones that started after just the segment delineated here with a dotted line. In our scores, uh, we take into account both the frequency of the echoes and how much of the story prefix is needed to trigger them. Being foretold by the early part of the story, this segment receives a low swener score. The writer's fourth segment adds a new unanticipated twist. None of the GPT generations introduce a door that is still closed and nobody has come out of there yet. At this point, before we reveal the writer's own ending, let's see examples of how GPT would finish the story. Ellen echoes this in 10% of continuations. The figure holds a brass whistle shaped like a duck and then somehow associated with this duck uh a deafening sound happens and some disar some uh damage happens. This is one example of it. This is another in first case it's a brass whistle shaped like a duck. In second case it's a yellow rubber duck. Either way there's either deafening quack or a a loud whoosh. And then as a result of that building begins to tremble and you have shattering glass or splintering wood. The writer's ending however is very different. The last door is still closed. You walk towards it and it opens. And what happens? A woman emerges from the room grinning eerily holding a paper that reads goose. In summary, our analysis of human uh authored stories versus uh those generative by large language models at various narrative strategies stages sorry reveals a consistent trend. Human sweet generary scores are significantly higher as generative AI seldom replicates human written story elements. Human narratives also tend to be lengthier and consistently maintain uh elevated generary scores throughout likely reflecting a deliberate effort to introduce novel content regularly and sustain reader engagement. The gap between GPT4 and human level performance is considerable based on confidence interval. There is less than 55% ch chance 5% that GPT4 would achieve scores within the human range for any given segment of a story progression. The likelihood of matching human scores throughout all segments of a story is estimated to be less than 1 in 10 trillion. Therefore, simply using repeated sampling of full stories to reach human levels scores would require significant resources. For example, generating a 10second story at current GPT4 pricing could cost around $5 billion. not accounting for the cost of identifying high scoring stories among the many samples and this is for generary scoring using 20 continuations per fragment. The LM stories that look unique based on 20 continuations per fragment will start to look echoey if we increase the sampling of 100 continuations per fragment. Generating the variety of stories produced by human writers likely requires additional approaches. One such approach would be to integrate large language model capabilities within a human-led collaborative workflow. Large language models may serve as effective assistive tools and facilitate collaboration among writers. Imagine if our writing camp could quickly produce a polished piece of work created by multiple authors. Work is ongoing to develop such tools for narrative and other multimodel content generation.
Original Description
In our recent PNAS paper we demonstrate that large language models produce little variation in generated narratives. Compared to those generations, a human-written narrative is usually Sui Generis, i.e. one of a kind. Or as we'd say in ML and statistics, human writing is in the tails of the distribution of the content LLMs generate. We introduced the Sui Generis (SG) score which can be used to evaluate distinctiveness of written text, whether it was written by a human or by a machine. SG scores may find its uses both in model improvement and in assistive tools (e.g. helping you to sound less like a GPT). As LLMs exhibit increasingly useful abilities to compare and refine ideas, and occasionally add to them, good writing in the future will likely still require human-led, possibly collaborative effort, but greatly assisted by AI.
Read our paper: https://www.microsoft.com/en-us/research/?p=1148547&post_type=msr-research-item&preview=1&_ppp=4ebaff60a7
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Microsoft Research · Microsoft Research · 0 of 60
← Previous
Next →
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP
Microsoft Research
Frontiers in Machine Learning: Climate Impact of Machine Learning
Microsoft Research
Frontiers in Machine Learning: Security and Machine Learning
Microsoft Research
Hope Speech and Help Speech: Surfacing Positivity Amidst Hate
Microsoft Research
Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities
Microsoft Research
Remote Work and Well-Being
Microsoft Research
Challenges and Gratitude of Software Developers During COVID-19 Working From Home
Microsoft Research
Towards a Practical Virtual Office for Mobile Knowledge Workers
Microsoft Research
Impact of COVID-19 crisis on the future of work in India
Microsoft Research
Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship
Microsoft Research
How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19
Microsoft Research
Phong Surface: Efficient 3D Model Fitting using Lifted Optimization
Microsoft Research
Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions
Microsoft Research
Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]
Microsoft Research
Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]
Microsoft Research
Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]
Microsoft Research
Directions in ML: Algorithmic foundations of neural architecture search
Microsoft Research
MineRL Competition 2020
Microsoft Research
Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal
Microsoft Research
From Paper to Product
Microsoft Research
SkinnerDB: Regret Bounded Query Evaluation using RL
Microsoft Research
From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks
Microsoft Research
Programming with Proofs for High-assurance Software
Microsoft Research
Platform for Situated Intelligence Overview
Microsoft Research
Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding
Microsoft Research
Galactic Bell Star Music Demo
Microsoft Research
Importing Animations in Microsoft Expressive Pixels (9 of 9)
Microsoft Research
Welcome to Microsoft Expressive Pixels (1 of 9)
Microsoft Research
Getting Started with Microsoft Expressive Pixels (2 of 9)
Microsoft Research
Creating an Image in Microsoft Expressive Pixels (3 of 9)
Microsoft Research
Creating Animations in Microsoft Expressive Pixels (4 of 9)
Microsoft Research
Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)
Microsoft Research
Creating Fragments in Microsoft Expressive Pixels (6 of 9)
Microsoft Research
Using Layers in Microsoft Expressive Pixels (7 of 9)
Microsoft Research
Exporting Animations with Microsoft Expressive Pixels (8 of 9)
Microsoft Research
What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)
Microsoft Research
What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)
Microsoft Research
Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation
Microsoft Research
Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma
Microsoft Research
Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)
Microsoft Research
Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)
Microsoft Research
Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)
Microsoft Research
Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)
Microsoft Research
Novel Image Captioning
Microsoft Research
Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays
Microsoft Research
Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface
Microsoft Research
How does holographic storage work?
Microsoft Research
The physics of hologram formation in iron doped lithium niobate
Microsoft Research
Introduction to coax: A Modular RL Package
Microsoft Research
Directions in ML: "Neural architecture search: Coming of age"
Microsoft Research
Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel
Microsoft Research
Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020
Microsoft Research
Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020
Microsoft Research
Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up
Microsoft Research
Clinical Research with FHIR
Microsoft Research
Soundscape Street Preview
Microsoft Research
Tilt-Responsive Techniques for Digital Drawing Boards
Microsoft Research
SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time
Microsoft Research
Haptic PIVOT: On-Demand Handhelds in VR
Microsoft Research
SurfaceFleet Supplemental Video Demonstration (UIST 2020)
Microsoft Research
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
How We Translate 300-Page Books Using Claude Without Hitting Token Limits
Dev.to · 龚旭东
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Medium · AI
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Medium · LLM
A simple way to test model fallbacks with RouterBase
Dev.to · routerbasecom
🎓
Tutor Explanation
DeepCamp AI