Reviewing LLMs for content creation

Harshit Tyagi · Beginner ·🧠 Large Language Models ·2y ago

Key Takeaways

The video reviews 4 LLMs for content creation, evaluating their performance on 5 use cases and 22 categories, using GP4 Turbo as a judge and 5 criteria with partial marking, with Llama 370 billion emerging as the winner.

Full Transcript

hey everyone what's up so I was working on this project that required me to figure out which is the best large language model for Content creation now just like any other person I went onto the LMS leaderboard looked at a few top models okay like gb4 turo CLA 3 Opus Gemini all of these are the top models right now I studied their model cards what companies are saying about these models where they shine how they're performing on the well established benchmarks and I compared them and looked at their cost Etc all of those things and I figured out that you know I need a cheaper alternative so I cannot probably go with gb4 turbo because that'll cost me more I cannot go with the CLA 3 Opus because that is also very expensive so I picked up a bunch of models and I ended up evaluating all of these models for my use case for Content creation so the four mods that I wanted to evaluate were Lama 370 billion mixl 8x7 billion Gemini 1.5 Pro and the last one was CLA 3 Sonet okay now I know that gp4 turbo does really well and CLA 3 Opus does really well but uh there were other constraints like cost how can I find youe how easily they are available Etc all those things okay so these are the four mods that I wanted to evaluate for the task of content creation Now content Creation in itself is huge okay so I broke content creation down into five different use cases the the use cases are first of all blog writing email writing okay content summarization script writing and copyrighting okay and then further also all of these use cases that I have just enlisted these can be further broken down you know with respect to the process that people follow for example blog writing first you create an outline then you research you you know refine your outline your research then you create version one edited version two final draft version three so on and so forth so I've broken down each of these different use cases into process or different types of categories for example copyrighting includes know it could be SEO copyrighting website copyrighting advertisement copyrighting so all of these five use cases are further extended into different categories and there are in total 22 categories that I've created okay for all the different 22 categories that we have I'm going to be EV valuating these four models now to evaluate these four models across these 22 categories that we have I had to create 22 creation prompts the other challenge was how to evaluate what is my evaluation framework how did I go on to evaluate the responses okay so the evaluation is done in two parts the first part is going to be done by gp4 Turbo which is the best so I'm going to going to be using chat GPD which by default uses gbd4 turbo so I'm going to use gbd4 turbo as a judge and I know that it might sound weird that you're using an llm to evaluate another lm's response but if you look at empty bench you know benchmarks there are many other benchmarks which use llms as their judge so the first part of the valuation is going to be done by gp4 Turbo and the second part is going to be done by me the final score is going to come as an average of the two scores okay now since I have five criteria to meet for each response each Criterion costs two points okay now there could be partial marking depending upon how well uh that particular criteria is met okay is this is the response uh nailing the complete criteria which is defined in one or two lines so if it Nails it then you get two points if it partially Nails it then you get 1 1.5 depending upon you know how much uh youve covered and if that particular criteria is not met then you get zero so in total for each evaluation you get marks out of 10 okay so GPD is going to score out of 10 and I am also going to score out of 10 based on the quality based on you know what I expected in that blog in that email in that copy so then the final score comes out as the average of the two scores so this is what I did for the evaluation frame framework pretty simple pretty straightforward so let's see how these models perform first of all blog writing okay now if you see in blog writing Lama 370 billion okay look at the GPD scores a perfect 10 on 10 on all my scores I felt that outline was pretty great okay re research llama 3's capability to learn from reference text is great then version one version two I think U you know could have been a bit better so I scored nine they were a bit verbos I felt but everything else was perfect okay so the attention to detail of llama 370 billion is is great it's uh you know fairly more nuanced as compared to other previous versions and then Mixel 8x7 billion as I said uh not quite there CLA 3 did really great Gemini 1.5 Pro also did really really well when it comes to outline creation first version second version third version and overall if you look at the scores so Lama 370 billion scored 48.5 out of 50 and was the winner for blog writing okay then now in email writing again Lama 370 billion did really really well now this category itself was a bit disappointing because as for the modern practices and and what I really wanted to see in the emails and to some extent I would say that I didn't quite do a really great job with the prompt also so but you know the case is that all the prompts are going to be the same for all the models so how the comparison will still be valid here I see Lama 370 billion again standing out uh scored 41.5 out of 50 and a close next was Sonet Sonet also did really well and then I think Mixel and gini they were also you know quite claw so overall I think Lama 370 billion did really well so outperformed all the others and these were uh the response was concise with Lama 3 all the others were verbos and you know uh redundant and too long so that was the case with email writing coming to copyrighting now copyrighting I felt was uh something that all the models did really well on okay gbd scores Lama 3 again you know outperformed with the only the SEO copy was the case where it CED n out of 10 other all the other copies were 10 on 10 and I also felt the same okay these were pretty close uh very good copies that were written by Lama 370 billion and Gemini also did really great when it comes to copy creation okay so Gemini also stood out when it comes to copyrighting and finally Lama 370 billion the verdict for copyrighting if you need a model that does well on copyrighting I would say you know pick pick Lama 370 billion and then whatever your use cases if you want to find tune if you want to use it for rag Etc all those things so I would you know personally uh pick Lama 370 billion and then Gemini 1.5 Pro is a close second then comes script writing now script writing was another use case where all the models did fairly well but Lama 370 billion again outperformed the other models I see I see the attention to detail and how it structures the entire copy the entire script be it you know your television script your commercial script corporate script the screen plays that I asked Lama 3 to write you see the response I think uh it's it's really really good uh I think it's there the first draft itself is there like 30 to 40% now based on like 60% of the work is left where manual effort you know manual curation and all of those things are going to be required but I think it does a pretty good job as per the instructions that were provided in the prompt so here again Lama 370 billion uh to the charts for uh script writing and in almost all the different categories here as you can see okay so winner Lama 3 70 billion for script writing so now let's come to the last use case the last use case is content summarization content summarization I had only two different categories one was essay summarization and second one was research paper summarization as for the you know use case that I had so I have seen that the CLA 3 family does really well with the reference text uh whenever you provide something to look up on so they do really really well and uh CLA 3 Sonet performed really well on summarization I also checked CLA 3 Opus for this particular task and I was uh you know surprised to see the output it is just too good the way it has structured the entire summarization of the Bitcoin paper U I think it's it's pretty close to like uh what I would personally give someone if they want to understand something understand the enti Bitcoin paper and just quickly go through it uh in a few minutes so the structure was only provided by CLA 3 uh none of the other models provided this sort of structure and the LA IC around it but yeah CLA 3 uh Opus did really well Sonet was you know a close second ha uh and when it comes to the final results here claw 3 sonant and Gemini 1.5 Pro both of them did really well surprisingly llama 370 billion did not do really well on the summarization task so that was one uh I would say observation that I had about Lama 370 billion not sure how well it does on uh summarization task now overall if you see the fin final scores look something like this with a total score of 199.00 the winner of this evaluation for Content creation the best model for Content creation is Lama 370 billion with a score of 199.00 then we have Gemini 1.5 Pro the first runner up with a score of 194 which does really well on summarization and script writing tasks the second runner up Claud 3 Sonet scored that I had but again if you think that you have a better model uh which would do really really well on any of these categories any of these use cases do let me know in the comments below and if you like the video please uh tell me in the comments if I should create such comparisons such reviews of different other Frameworks and different architectures different Cloud platforms here do let me know uh and then yeah there will be other tutorials and other videos that'll uh soon show up in your feet as promised earlier okay I'll see you in the next one until then keep learning keep building [Music]

Original Description

A side project. Find more details here: https://open.substack.com/pub/dswharshit/p/and-the-best-llm-for-content-creation?r=b7dzq&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true Learning resource repo, don’t forget to star the repo: https://github.com/dswh/ai-engineer-roadmap Newsletter: Follow me for more AI & AI Engineering content: - LinkedIn: https://www.linkedin.com/in/tyagiharshit/ - X / Twitter: https://twitter.com/dswharshit - Join the Discord community for ideas, discussion, and more: https://discord.gg/rssxJV2Xkz
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Playlist

Uploads from Harshit Tyagi · Harshit Tyagi · 45 of 60

1 Your PATH to learning Data Science
Your PATH to learning Data Science
Harshit Tyagi
2 Ideal Python environment setup for Data Science projects - Unix shell, Anaconda and Git.
Ideal Python environment setup for Data Science projects - Unix shell, Anaconda and Git.
Harshit Tyagi
3 Building COVID-19 interactive dashboard from Jupyter Notebook | No frontend/backend coding required.
Building COVID-19 interactive dashboard from Jupyter Notebook | No frontend/backend coding required.
Harshit Tyagi
4 Introduction to Jupyter Notebooks - Interface | Ipython Kernel | Sharing | GitHub
Introduction to Jupyter Notebooks - Interface | Ipython Kernel | Sharing | GitHub
Harshit Tyagi
5 Python fundamentals for Data Science - Part  1 | Data types | Strings | Lists
Python fundamentals for Data Science - Part 1 | Data types | Strings | Lists
Harshit Tyagi
6 Python fundamentals for Data Science - Part 2 Dictionaries | Conditionals | Loops | Functions
Python fundamentals for Data Science - Part 2 Dictionaries | Conditionals | Loops | Functions
Harshit Tyagi
7 Python fundamentals for Data Science - Part 3 OOPS | Working with External Libraries & Modules
Python fundamentals for Data Science - Part 3 OOPS | Working with External Libraries & Modules
Harshit Tyagi
8 NumPy Essentials for Data Science - part-1 | One Dimensional Array
NumPy Essentials for Data Science - part-1 | One Dimensional Array
Harshit Tyagi
9 NumPy Essentials for Data Science - part-2 | Multi-Dimensional Array
NumPy Essentials for Data Science - part-2 | Multi-Dimensional Array
Harshit Tyagi
10 Math For Data Science | Practical reasons to learn math for Machine/Deep Learning
Math For Data Science | Practical reasons to learn math for Machine/Deep Learning
Harshit Tyagi
11 Linear Algebra Ep 1 | Introduction to Vectors, Matrices and Tensors using NumPy
Linear Algebra Ep 1 | Introduction to Vectors, Matrices and Tensors using NumPy
Harshit Tyagi
12 Linear Algebra Ep 2 | Dot Product in Linear Algebra for Data Science
Linear Algebra Ep 2 | Dot Product in Linear Algebra for Data Science
Harshit Tyagi
13 Python vs R | The BEST programming language for your Data Science Project
Python vs R | The BEST programming language for your Data Science Project
Harshit Tyagi
14 Linear Algebra for Data Science Ep3 | Identity and Inverse Matrices | NumPy
Linear Algebra for Data Science Ep3 | Identity and Inverse Matrices | NumPy
Harshit Tyagi
15 The Data Show Ep1 | Elucidating Data Science in Drug Discovery - A CTO's Account
The Data Show Ep1 | Elucidating Data Science in Drug Discovery - A CTO's Account
Harshit Tyagi
16 Google Certified TensorFlow Developer | Learning Plan, Tips, FAQs & my Journey
Google Certified TensorFlow Developer | Learning Plan, Tips, FAQs & my Journey
Harshit Tyagi
17 Speeding up your Data Analysis | Hacks & Libraries
Speeding up your Data Analysis | Hacks & Libraries
Harshit Tyagi
18 How to build an Effective Data Science Portfolio
How to build an Effective Data Science Portfolio
Harshit Tyagi
19 End-to-End Machine Learning Project Tutorial - Part 1
End-to-End Machine Learning Project Tutorial - Part 1
Harshit Tyagi
20 Data Preparation with Sci-kit learn and Pandas | End-to-End ML Project Tutorial - Part 2
Data Preparation with Sci-kit learn and Pandas | End-to-End ML Project Tutorial - Part 2
Harshit Tyagi
21 Training and Fine-Tuning ML Models with Sklearn | End-to-End ML Project Tutorial - Part 3
Training and Fine-Tuning ML Models with Sklearn | End-to-End ML Project Tutorial - Part 3
Harshit Tyagi
22 Deploying a Trained ML model via Flask on Heroku | End-to-End ML Project Tutorial - Part 4
Deploying a Trained ML model via Flask on Heroku | End-to-End ML Project Tutorial - Part 4
Harshit Tyagi
23 Three Decades of Practising Data Science | Interview with Dean Abbott
Three Decades of Practising Data Science | Interview with Dean Abbott
Harshit Tyagi
24 Calculating Vector Norms - Linear Algebra for Data Science - IV
Calculating Vector Norms - Linear Algebra for Data Science - IV
Harshit Tyagi
25 Ep1 - Getting Started | Zero to Hero in Computer Vision with TensorFlow
Ep1 - Getting Started | Zero to Hero in Computer Vision with TensorFlow
Harshit Tyagi
26 Ep3 - Designing Data Experiments to enhance your Product | Rapido's Data Science Lead, Pramod N
Ep3 - Designing Data Experiments to enhance your Product | Rapido's Data Science Lead, Pramod N
Harshit Tyagi
27 Building projects with fastai - From Model Training to Deployment
Building projects with fastai - From Model Training to Deployment
Harshit Tyagi
28 October AI - Video Calling with One-Tenth of Internet Bandwidth
October AI - Video Calling with One-Tenth of Internet Bandwidth
Harshit Tyagi
29 November AI - Breakthrough in biology after 50 years | Datasets, books, research papers and more...
November AI - Breakthrough in biology after 50 years | Datasets, books, research papers and more...
Harshit Tyagi
30 Data Science learning roadmap for 2021
Data Science learning roadmap for 2021
Harshit Tyagi
31 Talk is cheap, BUILD - Microsoft Software Engineer | Interview with Abhirath Batra
Talk is cheap, BUILD - Microsoft Software Engineer | Interview with Abhirath Batra
Harshit Tyagi
32 Building a Habit of Reading Research Papers | Ft. Anurag Ghosh(Microsoft Researcher)
Building a Habit of Reading Research Papers | Ft. Anurag Ghosh(Microsoft Researcher)
Harshit Tyagi
33 Tableau vs Python - Building a COVID tracker dashboard
Tableau vs Python - Building a COVID tracker dashboard
Harshit Tyagi
34 [Explained] What is MLOps | Getting started with ML Engineering
[Explained] What is MLOps | Getting started with ML Engineering
Harshit Tyagi
35 Dmitry Petrov - Creator of DVC | ML Systems, Teams, Scaling challenges, and Learning Data Science
Dmitry Petrov - Creator of DVC | ML Systems, Teams, Scaling challenges, and Learning Data Science
Harshit Tyagi
36 Five hard truths about building a career in Data Science
Five hard truths about building a career in Data Science
Harshit Tyagi
37 Computing gradients using TensorFlow | Training a Linear Regression model from scratch.
Computing gradients using TensorFlow | Training a Linear Regression model from scratch.
Harshit Tyagi
38 Foundations for Data Science & ML - First steps for every beginner!
Foundations for Data Science & ML - First steps for every beginner!
Harshit Tyagi
39 Course Outline - Foundations for Data Science & ML
Course Outline - Foundations for Data Science & ML
Harshit Tyagi
40 How Machine Learning uses Linear Algebra to solve data problems
How Machine Learning uses Linear Algebra to solve data problems
Harshit Tyagi
41 Calculus for ML - How much you should know to get started
Calculus for ML - How much you should know to get started
Harshit Tyagi
42 Building a buzzing stocks news feed using NLP and Streamlit | Named Entity Recognition & Linking
Building a buzzing stocks news feed using NLP and Streamlit | Named Entity Recognition & Linking
Harshit Tyagi
43 AI Engineer - The next big tech role!
AI Engineer - The next big tech role!
Harshit Tyagi
44 AI researcher vs AI engineer | The next big tech role!
AI researcher vs AI engineer | The next big tech role!
Harshit Tyagi
Reviewing LLMs for content creation
Reviewing LLMs for content creation
Harshit Tyagi
46 Building a chatGPT-like bot on WhatsApp #coding  #chatgpt #engineering
Building a chatGPT-like bot on WhatsApp #coding #chatgpt #engineering
Harshit Tyagi
47 High Signal AI - the most action-oriented newsletter on the web! #ai
High Signal AI - the most action-oriented newsletter on the web! #ai
Harshit Tyagi
48 Building an AI-powered Discord Chatbot Locally for FREE using Ollama
Building an AI-powered Discord Chatbot Locally for FREE using Ollama
Harshit Tyagi
49 Build a second brain with Khoj 🧠  #ai #obsidian #plugins #productivity #engineering #notes
Build a second brain with Khoj 🧠 #ai #obsidian #plugins #productivity #engineering #notes
Harshit Tyagi
50 Summarising YouTube Videos using Ollama on Discord | Becoming an AI Engineer - Ep 2
Summarising YouTube Videos using Ollama on Discord | Becoming an AI Engineer - Ep 2
Harshit Tyagi
51 Watch the full video on my channel - Roadmap to become an AI Engineer.
Watch the full video on my channel - Roadmap to become an AI Engineer.
Harshit Tyagi
52 Mesop - Python-based UI framework from Google!
Mesop - Python-based UI framework from Google!
Harshit Tyagi
53 How I automated my YouTube | Gumloop tutorial | No Code
How I automated my YouTube | Gumloop tutorial | No Code
Harshit Tyagi
54 ARC PRIZE - Win $1Million to Beat the ARC-AGI benchmark
ARC PRIZE - Win $1Million to Beat the ARC-AGI benchmark
Harshit Tyagi
55 Microsoft's Autogen vs CrewAI - tested on a diverse range of use cases
Microsoft's Autogen vs CrewAI - tested on a diverse range of use cases
Harshit Tyagi
56 Claude #AI artifacts are just amazing!
Claude #AI artifacts are just amazing!
Harshit Tyagi
57 OpenAI releases CriticGPT to correct GPT-4's mistakes | Read the paper with me
OpenAI releases CriticGPT to correct GPT-4's mistakes | Read the paper with me
Harshit Tyagi
58 Day in my life | Vlog #1
Day in my life | Vlog #1
Harshit Tyagi
59 How to add AI Copilot to your application using CopilotKit | Tutorial
How to add AI Copilot to your application using CopilotKit | Tutorial
Harshit Tyagi
60 Quick Questions with an AI Founder - Anudeep Yegireddi
Quick Questions with an AI Founder - Anudeep Yegireddi
Harshit Tyagi

This video teaches how to evaluate LLMs for content creation, covering 5 use cases and 22 categories, and provides insights into the strengths and weaknesses of different LLMs. It matters because effective LLM evaluation is crucial for content creation applications. The video provides a comprehensive framework for evaluating LLMs and offers practical advice on how to use LLMs for content creation.

Key Takeaways
  1. Evaluate LLMs on 5 use cases: blog writing, email writing, content summarization, script writing, and copywriting
  2. Create 22 categories for each use case to assess LLM performance
  3. Use GP4 Turbo as a judge to evaluate LLM responses
  4. Assess LLM responses based on 5 criteria with partial marking
  5. Fine-tune LLMs for specific tasks to improve performance
  6. Explore LLM applications beyond text for multimodal content creation
💡 Llama 370 billion outperforms other LLMs in content creation tasks, particularly in blog writing, email writing, and script writing, but struggles with content summarization tasks.

Related AI Lessons

Sub-10ms AI Workflows: Accelerating sim.ai with On-Device Semantic Search using Moss
Learn how to accelerate AI workflows with on-device semantic search using Moss, achieving sub-10ms response times and improving user experience
Medium · Machine Learning
Anthropic Built a $100M Club for Its Smartest AI. You’re Probably Not In It.
Learn about Anthropic's Project Glasswing, a $100M club for its smartest AI, and understand the strategy behind it
Medium · LLM
Stop Guessing: Guaranteed Structured Output from LLMs in Node.js
Learn to guarantee structured output from LLMs in Node.js and stop parsing JSON manually
Dev.to · Hardik Mehta
Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)
Build a REST endpoint with Spring Boot 3 and OpenAI to create an LLM-powered API, leveraging the power of AI in your applications
Dev.to AI
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →