Echoes in GenAI generations

Microsoft Research · Advanced ·🧠 Large Language Models ·9mo ago

Skills: LLM Foundations90%Prompt Craft80%Fine-tuning LLMs80%Multimodal LLMs70%Advanced Prompting70%

Key Takeaways

The video discusses the limitations of large language models (LLMs) in generating creative narratives, using GPT4 as an example, and explores the potential of human-led collaborative workflows to facilitate collaboration among writers.

Full Transcript

We recently published a paper in the proceedings of the National Academy of Sciences on creative writing capabilities of generative AI. Large language models or LLMs are increasingly used as tools to assist writers. It is an open question whether large language models already trained on extensive human writing can generate a range of ideas comparable to those found in the narratives produced by human writers collectively. In short, the answer is no. Human writers take their stories in directions that LLMs rarely anticipate. While LLM generated stories often contain combinations of idiosyncratic plot elements echoed frequently across generations and across different LLMs. To introduce our analysis, let's consider a piece of microfiction from the famed early 20th century writer France Kafka. It was early in the morning, the streets clean and deserted. I was walking to the station. As I compared the tower clock with my watch, I realized that it was already much later than I had thought. I had to hurry. The shock of this discovery made me unsure of the way. I did not yet know my way very well in this town. Luckily, a policeman was nearby. I ran up to him and asked him the way. He smiled and said, "From me, you want to know the way?" "Yes," I said. "Give it up. Give it up," he said, and turned away with a sudden jerk, like people who want to be alone with their laughter. Wow. What happens if we take all but the last sentence of the story and prompt GPT4 to continue it? What if we ask for 100 continuations at temperature one at which the model is trained to match the distribution of the training data? Do we get a 100 alternative continuations to Kafka's ending? Actually, in terms of narrative development, we only get two types of continuations. The policeman either offers to walk the protagonist to the station or he gives the directions. In fact, in 50% of all samples, the policeman gives directions and they are the same. As in this example, follow the street for two blocks, then take a left at the bakery and you'll see the station just ahead. Or keep going straight for two blocks, then take a left and you'll see the station right ahead. Second left is a preferred direction. And the bakery mentioned in 18% of samples and usually having a red awning is a preferred landmark even though those are not in any obvious way foretold in the prompt. Those echoes across generations from a model and appear at different semantic levels. might be thought of just a reduction in lexical diversity such as the u propensity to use the word bakery in 18% of the samples instead of using a square or a park as landmarks which the model is perfectly capable of generating as well. Other echoes are on higher narrative level and influence how the story develops. We developed a method to detect and count these echoes at the narrative level and tested a technique on 100 narratives that include plot summaries of TV episodes as well as stories in the writing prompt data set written by amateur writers. So for example consider this prompt the writing prompt that a human has answered. The prompt says in the middle of the night a piece of paper is slipped under your door. It says duck. Let's see where a human writer would take us. We will go through the story summarizing a segment at a time from one of the writer's uh uh writings to this prompt. The writer takes the narrative in a direction that can be summarized as you open the door and see puzzled neighbors coming out each holding the same message. This immediately takes the story in a surprising direction, at least with GPT4. None of the 20 generations for the same story or the same prompt involve the idea of other neighbors receiving the same message under their doors too. Therefore being one of a kind this segment receives the maximum 3 13.8 sugenary score and we indicate that with the red color here uh lower scores will be indicated with the blue. Our scoring mechanism is based on the technique we developed for detecting and counting the essentially equivalent narrative elements. In the next segment, the writer emphasizes the puzz emphasizes the puzzling nature of the situation by inserting a reflective pause. People who would not ord their usual silence to exchange thoughts. Again, this does not happen in GPT4 generated stories. The third segment, however, is anticipated. The writer said, "Other neighbors loudly discuss the situation, unsure why they feel compelled by the mystery, but a vigorous discussion among the neighbors is almost inevitable, having already introduced the idea that neighbors are all drawn out of their uh rooms into the or of their apartments into the hallway. In GPT generated stories, a similar discussion happens 10% of the time. For example, this is an LLM continuation that would say the residents begin together discussing the event in harsh tincture stones and so on. Such discussions can happen earlier or later in GPT story continuation from this point or earlier points even in the ones that started after just the segment delineated here with a dotted line. In our scores, uh, we take into account both the frequency of the echoes and how much of the story prefix is needed to trigger them. Being foretold by the early part of the story, this segment receives a low swener score. The writer's fourth segment adds a new unanticipated twist. None of the GPT generations introduce a door that is still closed and nobody has come out of there yet. At this point, before we reveal the writer's own ending, let's see examples of how GPT would finish the story. Ellen echoes this in 10% of continuations. The figure holds a brass whistle shaped like a duck and then somehow associated with this duck uh a deafening sound happens and some disar some uh damage happens. This is one example of it. This is another in first case it's a brass whistle shaped like a duck. In second case it's a yellow rubber duck. Either way there's either deafening quack or a a loud whoosh. And then as a result of that building begins to tremble and you have shattering glass or splintering wood. The writer's ending however is very different. The last door is still closed. You walk towards it and it opens. And what happens? A woman emerges from the room grinning eerily holding a paper that reads goose. In summary, our analysis of human uh authored stories versus uh those generative by large language models at various narrative strategies stages sorry reveals a consistent trend. Human sweet generary scores are significantly higher as generative AI seldom replicates human written story elements. Human narratives also tend to be lengthier and consistently maintain uh elevated generary scores throughout likely reflecting a deliberate effort to introduce novel content regularly and sustain reader engagement. The gap between GPT4 and human level performance is considerable based on confidence interval. There is less than 55% ch chance 5% that GPT4 would achieve scores within the human range for any given segment of a story progression. The likelihood of matching human scores throughout all segments of a story is estimated to be less than 1 in 10 trillion. Therefore, simply using repeated sampling of full stories to reach human levels scores would require significant resources. For example, generating a 10second story at current GPT4 pricing could cost around $5 billion. not accounting for the cost of identifying high scoring stories among the many samples and this is for generary scoring using 20 continuations per fragment. The LM stories that look unique based on 20 continuations per fragment will start to look echoey if we increase the sampling of 100 continuations per fragment. Generating the variety of stories produced by human writers likely requires additional approaches. One such approach would be to integrate large language model capabilities within a human-led collaborative workflow. Large language models may serve as effective assistive tools and facilitate collaboration among writers. Imagine if our writing camp could quickly produce a polished piece of work created by multiple authors. Work is ongoing to develop such tools for narrative and other multimodel content generation.

Original Description

In our recent PNAS paper we demonstrate that large language models produce little variation in generated narratives. Compared to those generations, a human-written narrative is usually Sui Generis, i.e. one of a kind. Or as we'd say in ML and statistics, human writing is in the tails of the distribution of the content LLMs generate. We introduced the Sui Generis (SG) score which can be used to evaluate distinctiveness of written text, whether it was written by a human or by a machine. SG scores may find its uses both in model improvement and in assistive tools (e.g. helping you to sound less like a GPT). As LLMs exhibit increasingly useful abilities to compare and refine ideas, and occasionally add to them, good writing in the future will likely still require human-led, possibly collaborative effort, but greatly assisted by AI. Read our paper: https://www.microsoft.com/en-us/research/?p=1148547&post_type=msr-research-item&preview=1&_ppp=4ebaff60a7

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Microsoft Research · Microsoft Research · 0 of 60

← Previous Next →

Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP

Frontiers in ML: Learning from Limited Labeled Data: Challenges and Opportunities for NLP

Microsoft Research

Frontiers in Machine Learning: Climate Impact of Machine Learning

Frontiers in Machine Learning: Climate Impact of Machine Learning

Microsoft Research

Frontiers in Machine Learning: Security and Machine Learning

Frontiers in Machine Learning: Security and Machine Learning

Microsoft Research

Hope Speech and Help Speech: Surfacing Positivity Amidst Hate

Hope Speech and Help Speech: Surfacing Positivity Amidst Hate

Microsoft Research

Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities

Early Indicators of the Effect of the Global Shift to Remote Work on People with Disabilities

Microsoft Research

Remote Work and Well-Being

Remote Work and Well-Being

Microsoft Research

Challenges and Gratitude of Software Developers During COVID-19 Working From Home

Challenges and Gratitude of Software Developers During COVID-19 Working From Home

Microsoft Research

Towards a Practical Virtual Office for Mobile Knowledge Workers

Towards a Practical Virtual Office for Mobile Knowledge Workers

Microsoft Research

Impact of COVID-19 crisis on the future of work in India

Impact of COVID-19 crisis on the future of work in India

Microsoft Research

Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship

Empowering and Supporting Remote Software Development Team Members through a Culture of Allyship

Microsoft Research

How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19

How Work From Home Affects Collaboration: Information Workers in a Natural Experiment During COVID19

Microsoft Research

Phong Surface: Efficient 3D Model Fitting using Lifted Optimization

Phong Surface: Efficient 3D Model Fitting using Lifted Optimization

Microsoft Research

Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions

Managing Tasks Across the Work-Life Boundary: Opportunities, Challenges, and Directions

Microsoft Research

Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]

Microsoft Urban Futures Summer Workshop | Data Driven Urban Transformation [Day 1]

Microsoft Research

Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]

Microsoft Urban Futures Summer Workshop | Sensors and Data [Day 2]

Microsoft Research

Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]

Microsoft Urban Futures Summer Workshop | Policy and Social Impact [Day 3]

Microsoft Research

Directions in ML: Algorithmic foundations of neural architecture search

Directions in ML: Algorithmic foundations of neural architecture search

Microsoft Research

MineRL Competition 2020

MineRL Competition 2020

Microsoft Research

Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal

Can we make better software by using ML and AI techniques? With Chandra Maddila and Chetan Bansal

Microsoft Research

From Paper to Product

From Paper to Product

Microsoft Research

SkinnerDB: Regret Bounded Query Evaluation using RL

SkinnerDB: Regret Bounded Query Evaluation using RL

Microsoft Research

From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks

From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks

Microsoft Research

Programming with Proofs for High-assurance Software

Programming with Proofs for High-assurance Software

Microsoft Research

Platform for Situated Intelligence Overview

Platform for Situated Intelligence Overview

Microsoft Research

Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding

Directional Sources & Listeners in Interactive Sound Propagation using Reciprocal Wave Field Coding

Microsoft Research

Galactic Bell Star Music Demo

Galactic Bell Star Music Demo

Microsoft Research

Importing Animations in Microsoft Expressive Pixels (9 of 9)

Importing Animations in Microsoft Expressive Pixels (9 of 9)

Microsoft Research

Welcome to Microsoft Expressive Pixels (1 of 9)

Welcome to Microsoft Expressive Pixels (1 of 9)

Microsoft Research

Getting Started with Microsoft Expressive Pixels (2 of 9)

Getting Started with Microsoft Expressive Pixels (2 of 9)

Microsoft Research

Creating an Image in Microsoft Expressive Pixels (3 of 9)

Creating an Image in Microsoft Expressive Pixels (3 of 9)

Microsoft Research

Creating Animations in Microsoft Expressive Pixels (4 of 9)

Creating Animations in Microsoft Expressive Pixels (4 of 9)

Microsoft Research

Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)

Managing Animation Galleries in Microsoft Expressive Pixels (5 of 9)

Microsoft Research

Creating Fragments in Microsoft Expressive Pixels (6 of 9)

Creating Fragments in Microsoft Expressive Pixels (6 of 9)

Microsoft Research

Using Layers in Microsoft Expressive Pixels (7 of 9)

Using Layers in Microsoft Expressive Pixels (7 of 9)

Microsoft Research

Exporting Animations with Microsoft Expressive Pixels (8 of 9)

Exporting Animations with Microsoft Expressive Pixels (8 of 9)

Microsoft Research

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 2/2)

Microsoft Research

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)

What Kind of Computation is Human Cognition? A Brief History of Thought (Episode 1/2)

Microsoft Research

Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation

Planeverb: Interactive sound propagation for dynamic scenes using 2D wave simulation

Microsoft Research

Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma

Making cryptography accessible, efficient, and scalable with Dr. Divya Gupta and Dr. Rahul Sharma

Microsoft Research

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 Talk)

Microsoft Research

Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)

Optics for the cloud – Light at the end of the tunnel? (SIGCOMM 2020 Workshop)

Microsoft Research

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)

Beyond the mega-data center: networking multi-data center regions (SIGCOMM 2020 short talk)

Microsoft Research

Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)

Sirius: A Flat Datacenter Network with Nanosecond Optical Switching (SIGCOMM 2020 short talk)

Microsoft Research

Novel Image Captioning

Novel Image Captioning

Microsoft Research

Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays

Forest Sound Scene Simulation and Bird Localization with Distributed Microphone Arrays

Microsoft Research

Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface

Decoding Music Attention from “EEG headphones”: a User-friendly Auditory Brain-computer Interface

Microsoft Research

How does holographic storage work?

How does holographic storage work?

Microsoft Research

The physics of hologram formation in iron doped lithium niobate

The physics of hologram formation in iron doped lithium niobate

Microsoft Research

Introduction to coax: A Modular RL Package

Introduction to coax: A Modular RL Package

Microsoft Research

Directions in ML: "Neural architecture search: Coming of age"

Directions in ML: "Neural architecture search: Coming of age"

Microsoft Research

Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel

Microsoft Research AI Breakthroughs 2020: 20 minute research talks + Q&A panel

Microsoft Research

Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020

Fireside Chat with Johannes Gehrke during Microsoft Research AI Breakthroughs 2020

Microsoft Research

Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020

Fireside Chat with Susan Dumais during Microsoft Research AI Breakthroughs 2020

Microsoft Research

Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up

Microsoft Research AI Breakthroughs 2020: 20 minute research talks, Q&A panel, and event wrap-up

Microsoft Research

Clinical Research with FHIR

Clinical Research with FHIR

Microsoft Research

Soundscape Street Preview

Soundscape Street Preview

Microsoft Research

Tilt-Responsive Techniques for Digital Drawing Boards

Tilt-Responsive Techniques for Digital Drawing Boards

Microsoft Research

SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time

SurfaceFleet: Exploring Distributed Interactions Unbounded from Device, Application, User, and Time

Microsoft Research

Haptic PIVOT: On-Demand Handhelds in VR

Haptic PIVOT: On-Demand Handhelds in VR

Microsoft Research

SurfaceFleet Supplemental Video Demonstration (UIST 2020)

SurfaceFleet Supplemental Video Demonstration (UIST 2020)

Microsoft Research

The video discusses the limitations of LLMs in generating creative narratives and explores the potential of human-led collaborative workflows to facilitate collaboration among writers. It highlights the importance of fine-tuning and prompt crafting to improve LLM performance. By understanding the strengths and weaknesses of LLMs, writers and developers can design more effective workflows and tools to augment human creativity.

Key Takeaways

Generate narratives with LLMs
Evaluate LLM performance
Fine-tune LLMs for specific tasks
Craft effective prompts for LLMs
Integrate LLMs with human-led workflows
Design advanced prompts for LLMs
Optimize LLM performance with retrieval augmented generation

💡 The gap between LLM performance and human-level performance is considerable, and increasing sampling from 20 to 100 continuations per fragment makes LM stories look echoey.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

How We Translate 300-Page Books Using Claude Without Hitting Token Limits

Learn how to translate long documents using Claude without hitting token limits by breaking them into overlapping chunks

Dev.to · 龚旭东

Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking

Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve model performance

Building HITL Feedback RAG: Embeddings, Retrieval, and Reranking

Learn to build a Human-in-the-Loop (HITL) Feedback RAG system using embeddings, retrieval, and reranking to improve LLM performance

A simple way to test model fallbacks with RouterBase

Learn to test model fallbacks with RouterBase using a simple fallback wrapper and OpenAI-compatible API surface

Dev.to · routerbasecom

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)