Reviewing LLMs for content creation
Key Takeaways
The video reviews 4 LLMs for content creation, evaluating their performance on 5 use cases and 22 categories, using GP4 Turbo as a judge and 5 criteria with partial marking, with Llama 370 billion emerging as the winner.
Full Transcript
hey everyone what's up so I was working on this project that required me to figure out which is the best large language model for Content creation now just like any other person I went onto the LMS leaderboard looked at a few top models okay like gb4 turo CLA 3 Opus Gemini all of these are the top models right now I studied their model cards what companies are saying about these models where they shine how they're performing on the well established benchmarks and I compared them and looked at their cost Etc all of those things and I figured out that you know I need a cheaper alternative so I cannot probably go with gb4 turbo because that'll cost me more I cannot go with the CLA 3 Opus because that is also very expensive so I picked up a bunch of models and I ended up evaluating all of these models for my use case for Content creation so the four mods that I wanted to evaluate were Lama 370 billion mixl 8x7 billion Gemini 1.5 Pro and the last one was CLA 3 Sonet okay now I know that gp4 turbo does really well and CLA 3 Opus does really well but uh there were other constraints like cost how can I find youe how easily they are available Etc all those things okay so these are the four mods that I wanted to evaluate for the task of content creation Now content Creation in itself is huge okay so I broke content creation down into five different use cases the the use cases are first of all blog writing email writing okay content summarization script writing and copyrighting okay and then further also all of these use cases that I have just enlisted these can be further broken down you know with respect to the process that people follow for example blog writing first you create an outline then you research you you know refine your outline your research then you create version one edited version two final draft version three so on and so forth so I've broken down each of these different use cases into process or different types of categories for example copyrighting includes know it could be SEO copyrighting website copyrighting advertisement copyrighting so all of these five use cases are further extended into different categories and there are in total 22 categories that I've created okay for all the different 22 categories that we have I'm going to be EV valuating these four models now to evaluate these four models across these 22 categories that we have I had to create 22 creation prompts the other challenge was how to evaluate what is my evaluation framework how did I go on to evaluate the responses okay so the evaluation is done in two parts the first part is going to be done by gp4 Turbo which is the best so I'm going to going to be using chat GPD which by default uses gbd4 turbo so I'm going to use gbd4 turbo as a judge and I know that it might sound weird that you're using an llm to evaluate another lm's response but if you look at empty bench you know benchmarks there are many other benchmarks which use llms as their judge so the first part of the valuation is going to be done by gp4 Turbo and the second part is going to be done by me the final score is going to come as an average of the two scores okay now since I have five criteria to meet for each response each Criterion costs two points okay now there could be partial marking depending upon how well uh that particular criteria is met okay is this is the response uh nailing the complete criteria which is defined in one or two lines so if it Nails it then you get two points if it partially Nails it then you get 1 1.5 depending upon you know how much uh youve covered and if that particular criteria is not met then you get zero so in total for each evaluation you get marks out of 10 okay so GPD is going to score out of 10 and I am also going to score out of 10 based on the quality based on you know what I expected in that blog in that email in that copy so then the final score comes out as the average of the two scores so this is what I did for the evaluation frame framework pretty simple pretty straightforward so let's see how these models perform first of all blog writing okay now if you see in blog writing Lama 370 billion okay look at the GPD scores a perfect 10 on 10 on all my scores I felt that outline was pretty great okay re research llama 3's capability to learn from reference text is great then version one version two I think U you know could have been a bit better so I scored nine they were a bit verbos I felt but everything else was perfect okay so the attention to detail of llama 370 billion is is great it's uh you know fairly more nuanced as compared to other previous versions and then Mixel 8x7 billion as I said uh not quite there CLA 3 did really great Gemini 1.5 Pro also did really really well when it comes to outline creation first version second version third version and overall if you look at the scores so Lama 370 billion scored 48.5 out of 50 and was the winner for blog writing okay then now in email writing again Lama 370 billion did really really well now this category itself was a bit disappointing because as for the modern practices and and what I really wanted to see in the emails and to some extent I would say that I didn't quite do a really great job with the prompt also so but you know the case is that all the prompts are going to be the same for all the models so how the comparison will still be valid here I see Lama 370 billion again standing out uh scored 41.5 out of 50 and a close next was Sonet Sonet also did really well and then I think Mixel and gini they were also you know quite claw so overall I think Lama 370 billion did really well so outperformed all the others and these were uh the response was concise with Lama 3 all the others were verbos and you know uh redundant and too long so that was the case with email writing coming to copyrighting now copyrighting I felt was uh something that all the models did really well on okay gbd scores Lama 3 again you know outperformed with the only the SEO copy was the case where it CED n out of 10 other all the other copies were 10 on 10 and I also felt the same okay these were pretty close uh very good copies that were written by Lama 370 billion and Gemini also did really great when it comes to copy creation okay so Gemini also stood out when it comes to copyrighting and finally Lama 370 billion the verdict for copyrighting if you need a model that does well on copyrighting I would say you know pick pick Lama 370 billion and then whatever your use cases if you want to find tune if you want to use it for rag Etc all those things so I would you know personally uh pick Lama 370 billion and then Gemini 1.5 Pro is a close second then comes script writing now script writing was another use case where all the models did fairly well but Lama 370 billion again outperformed the other models I see I see the attention to detail and how it structures the entire copy the entire script be it you know your television script your commercial script corporate script the screen plays that I asked Lama 3 to write you see the response I think uh it's it's really really good uh I think it's there the first draft itself is there like 30 to 40% now based on like 60% of the work is left where manual effort you know manual curation and all of those things are going to be required but I think it does a pretty good job as per the instructions that were provided in the prompt so here again Lama 370 billion uh to the charts for uh script writing and in almost all the different categories here as you can see okay so winner Lama 3 70 billion for script writing so now let's come to the last use case the last use case is content summarization content summarization I had only two different categories one was essay summarization and second one was research paper summarization as for the you know use case that I had so I have seen that the CLA 3 family does really well with the reference text uh whenever you provide something to look up on so they do really really well and uh CLA 3 Sonet performed really well on summarization I also checked CLA 3 Opus for this particular task and I was uh you know surprised to see the output it is just too good the way it has structured the entire summarization of the Bitcoin paper U I think it's it's pretty close to like uh what I would personally give someone if they want to understand something understand the enti Bitcoin paper and just quickly go through it uh in a few minutes so the structure was only provided by CLA 3 uh none of the other models provided this sort of structure and the LA IC around it but yeah CLA 3 uh Opus did really well Sonet was you know a close second ha uh and when it comes to the final results here claw 3 sonant and Gemini 1.5 Pro both of them did really well surprisingly llama 370 billion did not do really well on the summarization task so that was one uh I would say observation that I had about Lama 370 billion not sure how well it does on uh summarization task now overall if you see the fin final scores look something like this with a total score of 199.00 the winner of this evaluation for Content creation the best model for Content creation is Lama 370 billion with a score of 199.00 then we have Gemini 1.5 Pro the first runner up with a score of 194 which does really well on summarization and script writing tasks the second runner up Claud 3 Sonet scored that I had but again if you think that you have a better model uh which would do really really well on any of these categories any of these use cases do let me know in the comments below and if you like the video please uh tell me in the comments if I should create such comparisons such reviews of different other Frameworks and different architectures different Cloud platforms here do let me know uh and then yeah there will be other tutorials and other videos that'll uh soon show up in your feet as promised earlier okay I'll see you in the next one until then keep learning keep building [Music]
Original Description
A side project.
Find more details here: https://open.substack.com/pub/dswharshit/p/and-the-best-llm-for-content-creation?r=b7dzq&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
Learning resource repo, don’t forget to star the repo: https://github.com/dswh/ai-engineer-roadmap
Newsletter:
Follow me for more AI & AI Engineering content:
- LinkedIn: https://www.linkedin.com/in/tyagiharshit/
- X / Twitter: https://twitter.com/dswharshit
- Join the Discord community for ideas, discussion, and more: https://discord.gg/rssxJV2Xkz
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Playlist
Uploads from Harshit Tyagi · Harshit Tyagi · 45 of 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
▶
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Your PATH to learning Data Science
Harshit Tyagi
Ideal Python environment setup for Data Science projects - Unix shell, Anaconda and Git.
Harshit Tyagi
Building COVID-19 interactive dashboard from Jupyter Notebook | No frontend/backend coding required.
Harshit Tyagi
Introduction to Jupyter Notebooks - Interface | Ipython Kernel | Sharing | GitHub
Harshit Tyagi
Python fundamentals for Data Science - Part 1 | Data types | Strings | Lists
Harshit Tyagi
Python fundamentals for Data Science - Part 2 Dictionaries | Conditionals | Loops | Functions
Harshit Tyagi
Python fundamentals for Data Science - Part 3 OOPS | Working with External Libraries & Modules
Harshit Tyagi
NumPy Essentials for Data Science - part-1 | One Dimensional Array
Harshit Tyagi
NumPy Essentials for Data Science - part-2 | Multi-Dimensional Array
Harshit Tyagi
Math For Data Science | Practical reasons to learn math for Machine/Deep Learning
Harshit Tyagi
Linear Algebra Ep 1 | Introduction to Vectors, Matrices and Tensors using NumPy
Harshit Tyagi
Linear Algebra Ep 2 | Dot Product in Linear Algebra for Data Science
Harshit Tyagi
Python vs R | The BEST programming language for your Data Science Project
Harshit Tyagi
Linear Algebra for Data Science Ep3 | Identity and Inverse Matrices | NumPy
Harshit Tyagi
The Data Show Ep1 | Elucidating Data Science in Drug Discovery - A CTO's Account
Harshit Tyagi
Google Certified TensorFlow Developer | Learning Plan, Tips, FAQs & my Journey
Harshit Tyagi
Speeding up your Data Analysis | Hacks & Libraries
Harshit Tyagi
How to build an Effective Data Science Portfolio
Harshit Tyagi
End-to-End Machine Learning Project Tutorial - Part 1
Harshit Tyagi
Data Preparation with Sci-kit learn and Pandas | End-to-End ML Project Tutorial - Part 2
Harshit Tyagi
Training and Fine-Tuning ML Models with Sklearn | End-to-End ML Project Tutorial - Part 3
Harshit Tyagi
Deploying a Trained ML model via Flask on Heroku | End-to-End ML Project Tutorial - Part 4
Harshit Tyagi
Three Decades of Practising Data Science | Interview with Dean Abbott
Harshit Tyagi
Calculating Vector Norms - Linear Algebra for Data Science - IV
Harshit Tyagi
Ep1 - Getting Started | Zero to Hero in Computer Vision with TensorFlow
Harshit Tyagi
Ep3 - Designing Data Experiments to enhance your Product | Rapido's Data Science Lead, Pramod N
Harshit Tyagi
Building projects with fastai - From Model Training to Deployment
Harshit Tyagi
October AI - Video Calling with One-Tenth of Internet Bandwidth
Harshit Tyagi
November AI - Breakthrough in biology after 50 years | Datasets, books, research papers and more...
Harshit Tyagi
Data Science learning roadmap for 2021
Harshit Tyagi
Talk is cheap, BUILD - Microsoft Software Engineer | Interview with Abhirath Batra
Harshit Tyagi
Building a Habit of Reading Research Papers | Ft. Anurag Ghosh(Microsoft Researcher)
Harshit Tyagi
Tableau vs Python - Building a COVID tracker dashboard
Harshit Tyagi
[Explained] What is MLOps | Getting started with ML Engineering
Harshit Tyagi
Dmitry Petrov - Creator of DVC | ML Systems, Teams, Scaling challenges, and Learning Data Science
Harshit Tyagi
Five hard truths about building a career in Data Science
Harshit Tyagi
Computing gradients using TensorFlow | Training a Linear Regression model from scratch.
Harshit Tyagi
Foundations for Data Science & ML - First steps for every beginner!
Harshit Tyagi
Course Outline - Foundations for Data Science & ML
Harshit Tyagi
How Machine Learning uses Linear Algebra to solve data problems
Harshit Tyagi
Calculus for ML - How much you should know to get started
Harshit Tyagi
Building a buzzing stocks news feed using NLP and Streamlit | Named Entity Recognition & Linking
Harshit Tyagi
AI Engineer - The next big tech role!
Harshit Tyagi
AI researcher vs AI engineer | The next big tech role!
Harshit Tyagi
Reviewing LLMs for content creation
Harshit Tyagi
Building a chatGPT-like bot on WhatsApp #coding #chatgpt #engineering
Harshit Tyagi
High Signal AI - the most action-oriented newsletter on the web! #ai
Harshit Tyagi
Building an AI-powered Discord Chatbot Locally for FREE using Ollama
Harshit Tyagi
Build a second brain with Khoj 🧠 #ai #obsidian #plugins #productivity #engineering #notes
Harshit Tyagi
Summarising YouTube Videos using Ollama on Discord | Becoming an AI Engineer - Ep 2
Harshit Tyagi
Watch the full video on my channel - Roadmap to become an AI Engineer.
Harshit Tyagi
Mesop - Python-based UI framework from Google!
Harshit Tyagi
How I automated my YouTube | Gumloop tutorial | No Code
Harshit Tyagi
ARC PRIZE - Win $1Million to Beat the ARC-AGI benchmark
Harshit Tyagi
Microsoft's Autogen vs CrewAI - tested on a diverse range of use cases
Harshit Tyagi
Claude #AI artifacts are just amazing!
Harshit Tyagi
OpenAI releases CriticGPT to correct GPT-4's mistakes | Read the paper with me
Harshit Tyagi
Day in my life | Vlog #1
Harshit Tyagi
How to add AI Copilot to your application using CopilotKit | Tutorial
Harshit Tyagi
Quick Questions with an AI Founder - Anudeep Yegireddi
Harshit Tyagi
More on: LLM Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Sub-10ms AI Workflows: Accelerating sim.ai with On-Device Semantic Search using Moss
Medium · Machine Learning
Anthropic Built a $100M Club for Its Smartest AI. You’re Probably Not In It.
Medium · LLM
Stop Guessing: Guaranteed Structured Output from LLMs in Node.js
Dev.to · Hardik Mehta
Spring AI Tutorial — Your First REST Endpoint with OpenAI (2026)
Dev.to AI
🎓
Tutor Explanation
DeepCamp AI