Reviewing LLMs for content creation

Harshit Tyagi · Beginner ·🧠 Large Language Models ·2y ago

Skills: LLM Foundations90%Prompt Craft80%Fine-tuning LLMs70%Multimodal LLMs60%Prompting Basics50%

Key Takeaways

The video reviews 4 LLMs for content creation, evaluating their performance on 5 use cases and 22 categories, using GP4 Turbo as a judge and 5 criteria with partial marking, with Llama 370 billion emerging as the winner.

Full Transcript

hey everyone what's up so I was working on this project that required me to figure out which is the best large language model for Content creation now just like any other person I went onto the LMS leaderboard looked at a few top models okay like gb4 turo CLA 3 Opus Gemini all of these are the top models right now I studied their model cards what companies are saying about these models where they shine how they're performing on the well established benchmarks and I compared them and looked at their cost Etc all of those things and I figured out that you know I need a cheaper alternative so I cannot probably go with gb4 turbo because that'll cost me more I cannot go with the CLA 3 Opus because that is also very expensive so I picked up a bunch of models and I ended up evaluating all of these models for my use case for Content creation so the four mods that I wanted to evaluate were Lama 370 billion mixl 8x7 billion Gemini 1.5 Pro and the last one was CLA 3 Sonet okay now I know that gp4 turbo does really well and CLA 3 Opus does really well but uh there were other constraints like cost how can I find youe how easily they are available Etc all those things okay so these are the four mods that I wanted to evaluate for the task of content creation Now content Creation in itself is huge okay so I broke content creation down into five different use cases the the use cases are first of all blog writing email writing okay content summarization script writing and copyrighting okay and then further also all of these use cases that I have just enlisted these can be further broken down you know with respect to the process that people follow for example blog writing first you create an outline then you research you you know refine your outline your research then you create version one edited version two final draft version three so on and so forth so I've broken down each of these different use cases into process or different types of categories for example copyrighting includes know it could be SEO copyrighting website copyrighting advertisement copyrighting so all of these five use cases are further extended into different categories and there are in total 22 categories that I've created okay for all the different 22 categories that we have I'm going to be EV valuating these four models now to evaluate these four models across these 22 categories that we have I had to create 22 creation prompts the other challenge was how to evaluate what is my evaluation framework how did I go on to evaluate the responses okay so the evaluation is done in two parts the first part is going to be done by gp4 Turbo which is the best so I'm going to going to be using chat GPD which by default uses gbd4 turbo so I'm going to use gbd4 turbo as a judge and I know that it might sound weird that you're using an llm to evaluate another lm's response but if you look at empty bench you know benchmarks there are many other benchmarks which use llms as their judge so the first part of the valuation is going to be done by gp4 Turbo and the second part is going to be done by me the final score is going to come as an average of the two scores okay now since I have five criteria to meet for each response each Criterion costs two points okay now there could be partial marking depending upon how well uh that particular criteria is met okay is this is the response uh nailing the complete criteria which is defined in one or two lines so if it Nails it then you get two points if it partially Nails it then you get 1 1.5 depending upon you know how much uh youve covered and if that particular criteria is not met then you get zero so in total for each evaluation you get marks out of 10 okay so GPD is going to score out of 10 and I am also going to score out of 10 based on the quality based on you know what I expected in that blog in that email in that copy so then the final score comes out as the average of the two scores so this is what I did for the evaluation frame framework pretty simple pretty straightforward so let's see how these models perform first of all blog writing okay now if you see in blog writing Lama 370 billion okay look at the GPD scores a perfect 10 on 10 on all my scores I felt that outline was pretty great okay re research llama 3's capability to learn from reference text is great then version one version two I think U you know could have been a bit better so I scored nine they were a bit verbos I felt but everything else was perfect okay so the attention to detail of llama 370 billion is is great it's uh you know fairly more nuanced as compared to other previous versions and then Mixel 8x7 billion as I said uh not quite there CLA 3 did really great Gemini 1.5 Pro also did really really well when it comes to outline creation first version second version third version and overall if you look at the scores so Lama 370 billion scored 48.5 out of 50 and was the winner for blog writing okay then now in email writing again Lama 370 billion did really really well now this category itself was a bit disappointing because as for the modern practices and and what I really wanted to see in the emails and to some extent I would say that I didn't quite do a really great job with the prompt also so but you know the case is that all the prompts are going to be the same for all the models so how the comparison will still be valid here I see Lama 370 billion again standing out uh scored 41.5 out of 50 and a close next was Sonet Sonet also did really well and then I think Mixel and gini they were also you know quite claw so overall I think Lama 370 billion did really well so outperformed all the others and these were uh the response was concise with Lama 3 all the others were verbos and you know uh redundant and too long so that was the case with email writing coming to copyrighting now copyrighting I felt was uh something that all the models did really well on okay gbd scores Lama 3 again you know outperformed with the only the SEO copy was the case where it CED n out of 10 other all the other copies were 10 on 10 and I also felt the same okay these were pretty close uh very good copies that were written by Lama 370 billion and Gemini also did really great when it comes to copy creation okay so Gemini also stood out when it comes to copyrighting and finally Lama 370 billion the verdict for copyrighting if you need a model that does well on copyrighting I would say you know pick pick Lama 370 billion and then whatever your use cases if you want to find tune if you want to use it for rag Etc all those things so I would you know personally uh pick Lama 370 billion and then Gemini 1.5 Pro is a close second then comes script writing now script writing was another use case where all the models did fairly well but Lama 370 billion again outperformed the other models I see I see the attention to detail and how it structures the entire copy the entire script be it you know your television script your commercial script corporate script the screen plays that I asked Lama 3 to write you see the response I think uh it's it's really really good uh I think it's there the first draft itself is there like 30 to 40% now based on like 60% of the work is left where manual effort you know manual curation and all of those things are going to be required but I think it does a pretty good job as per the instructions that were provided in the prompt so here again Lama 370 billion uh to the charts for uh script writing and in almost all the different categories here as you can see okay so winner Lama 3 70 billion for script writing so now let's come to the last use case the last use case is content summarization content summarization I had only two different categories one was essay summarization and second one was research paper summarization as for the you know use case that I had so I have seen that the CLA 3 family does really well with the reference text uh whenever you provide something to look up on so they do really really well and uh CLA 3 Sonet performed really well on summarization I also checked CLA 3 Opus for this particular task and I was uh you know surprised to see the output it is just too good the way it has structured the entire summarization of the Bitcoin paper U I think it's it's pretty close to like uh what I would personally give someone if they want to understand something understand the enti Bitcoin paper and just quickly go through it uh in a few minutes so the structure was only provided by CLA 3 uh none of the other models provided this sort of structure and the LA IC around it but yeah CLA 3 uh Opus did really well Sonet was you know a close second ha uh and when it comes to the final results here claw 3 sonant and Gemini 1.5 Pro both of them did really well surprisingly llama 370 billion did not do really well on the summarization task so that was one uh I would say observation that I had about Lama 370 billion not sure how well it does on uh summarization task now overall if you see the fin final scores look something like this with a total score of 199.00 the winner of this evaluation for Content creation the best model for Content creation is Lama 370 billion with a score of 199.00 then we have Gemini 1.5 Pro the first runner up with a score of 194 which does really well on summarization and script writing tasks the second runner up Claud 3 Sonet scored that I had but again if you think that you have a better model uh which would do really really well on any of these categories any of these use cases do let me know in the comments below and if you like the video please uh tell me in the comments if I should create such comparisons such reviews of different other Frameworks and different architectures different Cloud platforms here do let me know uh and then yeah there will be other tutorials and other videos that'll uh soon show up in your feet as promised earlier okay I'll see you in the next one until then keep learning keep building [Music]

Original Description

A side project. Find more details here: https://open.substack.com/pub/dswharshit/p/and-the-best-llm-for-content-creation?r=b7dzq&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true Learning resource repo, don’t forget to star the repo: https://github.com/dswh/ai-engineer-roadmap Newsletter: Follow me for more AI & AI Engineering content: - LinkedIn: https://www.linkedin.com/in/tyagiharshit/ - X / Twitter: https://twitter.com/dswharshit - Join the Discord community for ideas, discussion, and more: https://discord.gg/rssxJV2Xkz

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Harshit Tyagi · Harshit Tyagi · 45 of 60

← Previous Next →

Your PATH to learning Data Science

Your PATH to learning Data Science

Ideal Python environment setup for Data Science projects - Unix shell, Anaconda and Git.

Ideal Python environment setup for Data Science projects - Unix shell, Anaconda and Git.

Building COVID-19 interactive dashboard from Jupyter Notebook | No frontend/backend coding required.

Building COVID-19 interactive dashboard from Jupyter Notebook | No frontend/backend coding required.

Introduction to Jupyter Notebooks - Interface | Ipython Kernel | Sharing | GitHub

Introduction to Jupyter Notebooks - Interface | Ipython Kernel | Sharing | GitHub

Python fundamentals for Data Science - Part 1 | Data types | Strings | Lists

Python fundamentals for Data Science - Part 1 | Data types | Strings | Lists

Python fundamentals for Data Science - Part 2 Dictionaries | Conditionals | Loops | Functions

Python fundamentals for Data Science - Part 2 Dictionaries | Conditionals | Loops | Functions

Python fundamentals for Data Science - Part 3 OOPS | Working with External Libraries & Modules

Python fundamentals for Data Science - Part 3 OOPS | Working with External Libraries & Modules

NumPy Essentials for Data Science - part-1 | One Dimensional Array

NumPy Essentials for Data Science - part-1 | One Dimensional Array

NumPy Essentials for Data Science - part-2 | Multi-Dimensional Array

NumPy Essentials for Data Science - part-2 | Multi-Dimensional Array

Math For Data Science | Practical reasons to learn math for Machine/Deep Learning

Math For Data Science | Practical reasons to learn math for Machine/Deep Learning

Linear Algebra Ep 1 | Introduction to Vectors, Matrices and Tensors using NumPy

Linear Algebra Ep 1 | Introduction to Vectors, Matrices and Tensors using NumPy

Linear Algebra Ep 2 | Dot Product in Linear Algebra for Data Science

Linear Algebra Ep 2 | Dot Product in Linear Algebra for Data Science

Python vs R | The BEST programming language for your Data Science Project

Python vs R | The BEST programming language for your Data Science Project

Linear Algebra for Data Science Ep3 | Identity and Inverse Matrices | NumPy

Linear Algebra for Data Science Ep3 | Identity and Inverse Matrices | NumPy

The Data Show Ep1 | Elucidating Data Science in Drug Discovery - A CTO's Account

The Data Show Ep1 | Elucidating Data Science in Drug Discovery - A CTO's Account

Google Certified TensorFlow Developer | Learning Plan, Tips, FAQs & my Journey

Google Certified TensorFlow Developer | Learning Plan, Tips, FAQs & my Journey

Speeding up your Data Analysis | Hacks & Libraries

Speeding up your Data Analysis | Hacks & Libraries

How to build an Effective Data Science Portfolio

How to build an Effective Data Science Portfolio

End-to-End Machine Learning Project Tutorial - Part 1

End-to-End Machine Learning Project Tutorial - Part 1

Data Preparation with Sci-kit learn and Pandas | End-to-End ML Project Tutorial - Part 2

Data Preparation with Sci-kit learn and Pandas | End-to-End ML Project Tutorial - Part 2

Training and Fine-Tuning ML Models with Sklearn | End-to-End ML Project Tutorial - Part 3

Training and Fine-Tuning ML Models with Sklearn | End-to-End ML Project Tutorial - Part 3

Deploying a Trained ML model via Flask on Heroku | End-to-End ML Project Tutorial - Part 4

Deploying a Trained ML model via Flask on Heroku | End-to-End ML Project Tutorial - Part 4

Three Decades of Practising Data Science | Interview with Dean Abbott

Three Decades of Practising Data Science | Interview with Dean Abbott

Calculating Vector Norms - Linear Algebra for Data Science - IV

Calculating Vector Norms - Linear Algebra for Data Science - IV

Ep1 - Getting Started | Zero to Hero in Computer Vision with TensorFlow

Ep1 - Getting Started | Zero to Hero in Computer Vision with TensorFlow

Ep3 - Designing Data Experiments to enhance your Product | Rapido's Data Science Lead, Pramod N

Ep3 - Designing Data Experiments to enhance your Product | Rapido's Data Science Lead, Pramod N

Building projects with fastai - From Model Training to Deployment

Building projects with fastai - From Model Training to Deployment

October AI - Video Calling with One-Tenth of Internet Bandwidth

October AI - Video Calling with One-Tenth of Internet Bandwidth

November AI - Breakthrough in biology after 50 years | Datasets, books, research papers and more...

November AI - Breakthrough in biology after 50 years | Datasets, books, research papers and more...

Data Science learning roadmap for 2021

Data Science learning roadmap for 2021

Talk is cheap, BUILD - Microsoft Software Engineer | Interview with Abhirath Batra

Talk is cheap, BUILD - Microsoft Software Engineer | Interview with Abhirath Batra

Building a Habit of Reading Research Papers | Ft. Anurag Ghosh(Microsoft Researcher)

Building a Habit of Reading Research Papers | Ft. Anurag Ghosh(Microsoft Researcher)

Tableau vs Python - Building a COVID tracker dashboard

Tableau vs Python - Building a COVID tracker dashboard

[Explained] What is MLOps | Getting started with ML Engineering

[Explained] What is MLOps | Getting started with ML Engineering

Dmitry Petrov - Creator of DVC | ML Systems, Teams, Scaling challenges, and Learning Data Science

Dmitry Petrov - Creator of DVC | ML Systems, Teams, Scaling challenges, and Learning Data Science

Five hard truths about building a career in Data Science

Five hard truths about building a career in Data Science

Computing gradients using TensorFlow | Training a Linear Regression model from scratch.

Computing gradients using TensorFlow | Training a Linear Regression model from scratch.

Foundations for Data Science & ML - First steps for every beginner!

Foundations for Data Science & ML - First steps for every beginner!

Course Outline - Foundations for Data Science & ML

Course Outline - Foundations for Data Science & ML

How Machine Learning uses Linear Algebra to solve data problems

How Machine Learning uses Linear Algebra to solve data problems

Calculus for ML - How much you should know to get started

Calculus for ML - How much you should know to get started

Building a buzzing stocks news feed using NLP and Streamlit | Named Entity Recognition & Linking

Building a buzzing stocks news feed using NLP and Streamlit | Named Entity Recognition & Linking

AI Engineer - The next big tech role!

AI Engineer - The next big tech role!

AI researcher vs AI engineer | The next big tech role!

AI researcher vs AI engineer | The next big tech role!

Reviewing LLMs for content creation

Reviewing LLMs for content creation

Building a chatGPT-like bot on WhatsApp #coding #chatgpt #engineering

Building a chatGPT-like bot on WhatsApp #coding #chatgpt #engineering

High Signal AI - the most action-oriented newsletter on the web! #ai

High Signal AI - the most action-oriented newsletter on the web! #ai

Building an AI-powered Discord Chatbot Locally for FREE using Ollama

Building an AI-powered Discord Chatbot Locally for FREE using Ollama

Build a second brain with Khoj 🧠 #ai #obsidian #plugins #productivity #engineering #notes

Build a second brain with Khoj 🧠 #ai #obsidian #plugins #productivity #engineering #notes

Summarising YouTube Videos using Ollama on Discord | Becoming an AI Engineer - Ep 2

Summarising YouTube Videos using Ollama on Discord | Becoming an AI Engineer - Ep 2

Watch the full video on my channel - Roadmap to become an AI Engineer.

Watch the full video on my channel - Roadmap to become an AI Engineer.

Mesop - Python-based UI framework from Google!

Mesop - Python-based UI framework from Google!

How I automated my YouTube | Gumloop tutorial | No Code

How I automated my YouTube | Gumloop tutorial | No Code

ARC PRIZE - Win $1Million to Beat the ARC-AGI benchmark

ARC PRIZE - Win $1Million to Beat the ARC-AGI benchmark

Microsoft's Autogen vs CrewAI - tested on a diverse range of use cases

Microsoft's Autogen vs CrewAI - tested on a diverse range of use cases

Claude #AI artifacts are just amazing!

Claude #AI artifacts are just amazing!

OpenAI releases CriticGPT to correct GPT-4's mistakes | Read the paper with me

OpenAI releases CriticGPT to correct GPT-4's mistakes | Read the paper with me

Day in my life | Vlog #1

Day in my life | Vlog #1

How to add AI Copilot to your application using CopilotKit | Tutorial

How to add AI Copilot to your application using CopilotKit | Tutorial

Quick Questions with an AI Founder - Anudeep Yegireddi

Quick Questions with an AI Founder - Anudeep Yegireddi

This video teaches how to evaluate LLMs for content creation, covering 5 use cases and 22 categories, and provides insights into the strengths and weaknesses of different LLMs. It matters because effective LLM evaluation is crucial for content creation applications. The video provides a comprehensive framework for evaluating LLMs and offers practical advice on how to use LLMs for content creation.

Key Takeaways

Evaluate LLMs on 5 use cases: blog writing, email writing, content summarization, script writing, and copywriting
Create 22 categories for each use case to assess LLM performance
Use GP4 Turbo as a judge to evaluate LLM responses
Assess LLM responses based on 5 criteria with partial marking
Fine-tune LLMs for specific tasks to improve performance
Explore LLM applications beyond text for multimodal content creation

💡 Llama 370 billion outperforms other LLMs in content creation tasks, particularly in blog writing, email writing, and script writing, but struggles with content summarization tasks.

🔒 Pro feature: Ask AI to explain this lesson →

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Related AI Lessons

Sub-10ms AI Workflows: Accelerating sim.ai with On-Device Semantic Search using Moss

Learn how to accelerate AI workflows with on-device semantic search using Moss, achieving sub-10ms response times and improving user experience

Medium · Machine Learning

Anthropic Built a $100M Club for Its Smartest AI. You’re Probably Not In It.

Learn about Anthropic's Project Glasswing, a $100M club for its smartest AI, and understand the strategy behind it

Stop Guessing: Guaranteed Structured Output from LLMs in Node.js

Learn to guarantee structured output from LLMs in Node.js and stop parsing JSON manually

Dev.to · Hardik Mehta

Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)

Build a REST endpoint with Spring Boot 3 and OpenAI to create an LLM-powered API, leveraging the power of AI in your applications

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)