Echoes in GenAI generations

Microsoft Research · Advanced ·🧠 Large Language Models ·9mo ago

Key Takeaways

The video discusses the limitations of large language models (LLMs) in generating creative narratives, using GPT4 as an example, and explores the potential of human-led collaborative workflows to facilitate collaboration among writers.

Full Transcript

We recently published a paper in the proceedings of the National Academy of Sciences on creative writing capabilities of generative AI. Large language models or LLMs are increasingly used as tools to assist writers. It is an open question whether large language models already trained on extensive human writing can generate a range of ideas comparable to those found in the narratives produced by human writers collectively. In short, the answer is no. Human writers take their stories in directions that LLMs rarely anticipate. While LLM generated stories often contain combinations of idiosyncratic plot elements echoed frequently across generations and across different LLMs. To introduce our analysis, let's consider a piece of microfiction from the famed early 20th century writer France Kafka. It was early in the morning, the streets clean and deserted. I was walking to the station. As I compared the tower clock with my watch, I realized that it was already much later than I had thought. I had to hurry. The shock of this discovery made me unsure of the way. I did not yet know my way very well in this town. Luckily, a policeman was nearby. I ran up to him and asked him the way. He smiled and said, "From me, you want to know the way?" "Yes," I said. "Give it up. Give it up," he said, and turned away with a sudden jerk, like people who want to be alone with their laughter. Wow. What happens if we take all but the last sentence of the story and prompt GPT4 to continue it? What if we ask for 100 continuations at temperature one at which the model is trained to match the distribution of the training data? Do we get a 100 alternative continuations to Kafka's ending? Actually, in terms of narrative development, we only get two types of continuations. The policeman either offers to walk the protagonist to the station or he gives the directions. In fact, in 50% of all samples, the policeman gives directions and they are the same. As in this example, follow the street for two blocks, then take a left at the bakery and you'll see the station just ahead. Or keep going straight for two blocks, then take a left and you'll see the station right ahead. Second left is a preferred direction. And the bakery mentioned in 18% of samples and usually having a red awning is a preferred landmark even though those are not in any obvious way foretold in the prompt. Those echoes across generations from a model and appear at different semantic levels. might be thought of just a reduction in lexical diversity such as the u propensity to use the word bakery in 18% of the samples instead of using a square or a park as landmarks which the model is perfectly capable of generating as well. Other echoes are on higher narrative level and influence how the story develops. We developed a method to detect and count these echoes at the narrative level and tested a technique on 100 narratives that include plot summaries of TV episodes as well as stories in the writing prompt data set written by amateur writers. So for example consider this prompt the writing prompt that a human has answered. The prompt says in the middle of the night a piece of paper is slipped under your door. It says duck. Let's see where a human writer would take us. We will go through the story summarizing a segment at a time from one of the writer's uh uh writings to this prompt. The writer takes the narrative in a direction that can be summarized as you open the door and see puzzled neighbors coming out each holding the same message. This immediately takes the story in a surprising direction, at least with GPT4. None of the 20 generations for the same story or the same prompt involve the idea of other neighbors receiving the same message under their doors too. Therefore being one of a kind this segment receives the maximum 3 13.8 sugenary score and we indicate that with the red color here uh lower scores will be indicated with the blue. Our scoring mechanism is based on the technique we developed for detecting and counting the essentially equivalent narrative elements. In the next segment, the writer emphasizes the puzz emphasizes the puzzling nature of the situation by inserting a reflective pause. People who would not ord their usual silence to exchange thoughts. Again, this does not happen in GPT4 generated stories. The third segment, however, is anticipated. The writer said, "Other neighbors loudly discuss the situation, unsure why they feel compelled by the mystery, but a vigorous discussion among the neighbors is almost inevitable, having already introduced the idea that neighbors are all drawn out of their uh rooms into the or of their apartments into the hallway. In GPT generated stories, a similar discussion happens 10% of the time. For example, this is an LLM continuation that would say the residents begin together discussing the event in harsh tincture stones and so on. Such discussions can happen earlier or later in GPT story continuation from this point or earlier points even in the ones that started after just the segment delineated here with a dotted line. In our scores, uh, we take into account both the frequency of the echoes and how much of the story prefix is needed to trigger them. Being foretold by the early part of the story, this segment receives a low swener score. The writer's fourth segment adds a new unanticipated twist. None of the GPT generations introduce a door that is still closed and nobody has come out of there yet. At this point, before we reveal the writer's own ending, let's see examples of how GPT would finish the story. Ellen echoes this in 10% of continuations. The figure holds a brass whistle shaped like a duck and then somehow associated with this duck uh a deafening sound happens and some disar some uh damage happens. This is one example of it. This is another in first case it's a brass whistle shaped like a duck. In second case it's a yellow rubber duck. Either way there's either deafening quack or a a loud whoosh. And then as a result of that building begins to tremble and you have shattering glass or splintering wood. The writer's ending however is very different. The last door is still closed. You walk towards it and it opens. And what happens? A woman emerges from the room grinning eerily holding a paper that reads goose. In summary, our analysis of human uh authored stories versus uh those generative by large language models at various narrative strategies stages sorry reveals a consistent trend. Human sweet generary scores are significantly higher as generative AI seldom replicates human written story elements. Human narratives also tend to be lengthier and consistently maintain uh elevated generary scores throughout likely reflecting a deliberate effort to introduce novel content regularly and sustain reader engagement. The gap between GPT4 and human level performance is considerable based on confidence interval. There is less than 55% ch chance 5% that GPT4 would achieve scores within the human range for any given segment of a story progression. The likelihood of matching human scores throughout all segments of a story is estimated to be less than 1 in 10 trillion. Therefore, simply using repeated sampling of full stories to reach human levels scores would require significant resources. For example, generating a 10second story at current GPT4 pricing could cost around $5 billion. not accounting for the cost of identifying high scoring stories among the many samples and this is for generary scoring using 20 continuations per fragment. The LM stories that look unique based on 20 continuations per fragment will start to look echoey if we increase the sampling of 100 continuations per fragment. Generating the variety of stories produced by human writers likely requires additional approaches. One such approach would be to integrate large language model capabilities within a human-led collaborative workflow. Large language models may serve as effective assistive tools and facilitate collaboration among writers. Imagine if our writing camp could quickly produce a polished piece of work created by multiple authors. Work is ongoing to develop such tools for narrative and other multimodel content generation.

Original Description

In our recent PNAS paper we demonstrate that large language models produce little variation in generated narratives. Compared to those generations, a human-written narrative is usually Sui Generis, i.e. one of a kind. Or as we'd say in ML and statistics, human writing is in the tails of the distribution of the content LLMs generate. We introduced the Sui Generis (SG) score which can be used to evaluate distinctiveness of written text, whether it was written by a human or by a machine. SG scores may find its uses both in model improvement and in assistive tools (e.g. helping you to sound less like a GPT). As LLMs exhibit increasingly useful abilities to compare and refine ideas, and occasionally add to them, good writing in the future will likely still require human-led, possibly collaborative effort, but greatly assisted by AI. Read our paper: https://www.microsoft.com/en-us/research/?p=1148547&post_type=msr-research-item&preview=1&_ppp=4ebaff60a7
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Microsoft Research · Microsoft Research · 0 of 60

← Previous Next →
1 Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP
Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP
Microsoft Research
2 Frontiers in Machine Learning: Climate Impact of Machine Learning
Frontiers in Machine Learning: Climate Impact of Machine Learning
Microsoft Research
3 Frontiers in Machine Learning: Security and Machine Learning
Frontiers in Machine Learning: Security and Machine Learning
Microsoft Research
4 Hope Speech and Help Speech: Surfacing Positivity Amidst Hate
Hope Speech and Help Speech: Surfacing Positivity Amidst Hate
Microsoft Research
5 Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities
Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities
Microsoft Research
6 Remote Work and Well-Being
Remote Work and Well-Being
Microsoft Research
7 Challenges and Gratitude of Software Developers During COVID-19 Working From Home
Challenges and Gratitude of Software Developers During COVID-19 Working From Home
Microsoft Research
8 Towards a Practical Virtual Office for Mobile Knowledge Workers
Towards a Practical Virtual Office for Mobile Knowledge Workers
Microsoft Research
9 Impact of COVID-19 crisis on the future of work in India
Impact of COVID-19 crisis on the future of work in India
Microsoft Research
10 Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship
Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship
Microsoft Research
11 How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19
How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19
Microsoft Research
12 Phong Surface: Efficient 3D Model Fitting using Lifted Optimization
Phong Surface: Efficient 3D Model Fitting using Lifted Optimization
Microsoft Research
13 Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions
Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions
Microsoft Research
14 Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]
Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]
Microsoft Research
15 Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]
Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]
Microsoft Research
16 Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]
Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]
Microsoft Research
17 Directions in ML: Algorithmic foundations of neural architecture search
Directions in ML: Algorithmic foundations of neural architecture search
Microsoft Research
18 MineRL Competition 2020
MineRL Competition 2020
Microsoft Research
19 Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal
Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal
Microsoft Research
20 From Paper to Product
From Paper to Product
Microsoft Research
21 SkinnerDB: Regret Bounded Query Evaluation using RL
SkinnerDB: Regret Bounded Query Evaluation using RL
Microsoft Research
22 From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks
From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks
Microsoft Research
23 Programming with Proofs for High-assurance Software
Programming with Proofs for High-assurance Software
Microsoft Research
24 Platform for Situated Intelligence Overview
Platform for Situated Intelligence Overview
Microsoft Research
25 Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding
Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding
Microsoft Research
26 Galactic Bell Star Music Demo
Galactic Bell Star Music Demo
Microsoft Research
27 Importing Animations in Microsoft Expressive Pixels (9 of 9)
Importing Animations in Microsoft Expressive Pixels (9 of 9)
Microsoft Research
28 Welcome to Microsoft Expressive Pixels (1 of 9)
Welcome to Microsoft Expressive Pixels (1 of 9)
Microsoft Research
29 Getting Started with Microsoft Expressive Pixels (2 of 9)
Getting Started with Microsoft Expressive Pixels (2 of 9)
Microsoft Research
30 Creating an Image in Microsoft Expressive Pixels (3 of 9)
Creating an Image in Microsoft Expressive Pixels (3 of 9)
Microsoft Research
31 Creating Animations in Microsoft Expressive Pixels (4 of 9)
Creating Animations in Microsoft Expressive Pixels (4 of 9)
Microsoft Research
32 Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)
Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)
Microsoft Research
33 Creating Fragments in Microsoft Expressive Pixels (6 of 9)
Creating Fragments in Microsoft Expressive Pixels (6 of 9)
Microsoft Research
34 Using Layers in Microsoft Expressive Pixels (7 of 9)
Using Layers in Microsoft Expressive Pixels (7 of 9)
Microsoft Research
35 Exporting Animations with Microsoft Expressive Pixels (8 of 9)
Exporting Animations with Microsoft Expressive Pixels (8 of 9)
Microsoft Research
36 What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)
What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)
Microsoft Research
37 What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)
What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)
Microsoft Research
38 Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation
Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation
Microsoft Research
39 Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma
Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma
Microsoft Research
40 Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)
Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)
Microsoft Research
41 Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)
Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)
Microsoft Research
42 Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)
Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)
Microsoft Research
43 Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)
Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)
Microsoft Research
44 Novel Image Captioning
Novel Image Captioning
Microsoft Research
45 Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays
Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays
Microsoft Research
46 Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface
Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface
Microsoft Research
47 How does holographic storage work?
How does holographic storage work?
Microsoft Research
48 The physics of hologram formation in iron doped lithium niobate
The physics of hologram formation in iron doped lithium niobate
Microsoft Research
49 Introduction to coax: A Modular RL Package
Introduction to coax: A Modular RL Package
Microsoft Research
50 Directions in ML: "Neural architecture search: Coming of age"
Directions in ML: "Neural architecture search: Coming of age"
Microsoft Research
51 Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel
Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel
Microsoft Research
52 Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020
Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020
Microsoft Research
53 Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020
Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020
Microsoft Research
54 Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up
Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up
Microsoft Research
55 Clinical Research with FHIR
Clinical Research with FHIR
Microsoft Research
56 Soundscape Street Preview
Soundscape Street Preview
Microsoft Research
57 Tilt-Responsive Techniques for Digital Drawing Boards
Tilt-Responsive Techniques for Digital Drawing Boards
Microsoft Research
58 SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time
SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time
Microsoft Research
59 Haptic PIVOT: On-Demand Handhelds in VR
Haptic PIVOT: On-Demand Handhelds in VR
Microsoft Research
60 SurfaceFleet Supplemental Video Demonstration (UIST 2020)
SurfaceFleet Supplemental Video Demonstration (UIST 2020)
Microsoft Research

The video discusses the limitations of LLMs in generating creative narratives and explores the potential of human-led collaborative workflows to facilitate collaboration among writers. It highlights the importance of fine-tuning and prompt crafting to improve LLM performance. By understanding the strengths and weaknesses of LLMs, writers and developers can design more effective workflows and tools to augment human creativity.

Key Takeaways
  1. Generate narratives with LLMs
  2. Evaluate LLM performance
  3. Fine-tune LLMs for specific tasks
  4. Craft effective prompts for LLMs
  5. Integrate LLMs with human-led workflows
  6. Design advanced prompts for LLMs
  7. Optimize LLM performance with retrieval augmented generation
💡 The gap between LLM performance and human-level performance is considerable, and increasing sampling from 20 to 100 continuations per fragment makes LM stories look echoey.

Related AI Lessons

How We Translate 300-Page Books Using Claude Without Hitting Token Limits
Learn how to translate long documents using Claude without hitting token limits by breaking them into overlapping chunks
Dev.to · 龚旭东
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve model performance
Medium · AI
Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking
Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve LLM performance
Medium · LLM
A simple way to test model fallbacks with RouterBase
Learn to test model fallbacks with RouterBase using a simple fallback wrapper and OpenAI-compatible API surface
Dev.to · routerbasecom
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →